From what I've found there is 1 other question like this (Speed-up nested cross-validation) however installing MPI does not work for me after trying several fixes also suggested on this site and microsoft, so I am hoping there is another package or answer to this question.
I am looking to compare multiple algorithms and gridsearch a wide range of parameters (maybe too many parameters?), what ways are there besides mpi4py which could speed up running my code? As I understand it I cannot use n_jobs=-1 as that is then not nested?
Also to note, I have not been able to run this on the many parameters I am trying to look at below (runs longer than I have time). Only have results after 2 hours if I give each model only 2 parameters to compare. Also I run this code on a dataset of 252 rows and 25 feature columns with 4 categorical variables to predict ('certain', 'likely', 'possible', or 'unknown') whether a gene (with 252 genes) affects a disease. Using SMOTE increases the sample size to 420 which is then what goes into use.
dataset= pd.read_csv('data.csv')
data = dataset.drop(["gene"],1)
df = data.iloc[:,0:24]
df = df.fillna(0)
X = MinMaxScaler().fit_transform(df)le = preprocessing.LabelEncoder()
encoded_value = le.fit_transform(["certain", "likely", "possible", "unlikely"])
Y = le.fit_transform(data["category"])sm = SMOTE(random_state=100)
X_res, y_res = sm.fit_resample(X, Y)seed = 7
logreg = LogisticRegression(penalty='l1', solver='liblinear',multi_class='auto')
LR_par= {'penalty':['l1'], 'C': [0.5, 1, 5, 10], 'max_iter':[500, 1000, 5000]}rfc =RandomForestClassifier()
param_grid = {'bootstrap': [True, False],'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],'max_features': ['auto', 'sqrt'],'min_samples_leaf': [1, 2, 4,25],'min_samples_split': [2, 5, 10, 25],'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]}mlp = MLPClassifier(random_state=seed)
parameter_space = {'hidden_layer_sizes': [(10,20), (10,20,10), (50,)],'activation': ['tanh', 'relu'],'solver': ['adam', 'sgd'],'max_iter': [10000],'alpha': [0.1, 0.01, 0.001],'learning_rate': ['constant','adaptive']}gbm = GradientBoostingClassifier(min_samples_split=25, min_samples_leaf=25)
param = {"loss":["deviance"],"learning_rate": [0.15,0.1,0.05,0.01,0.005,0.001],"min_samples_split": [2, 5, 10, 25],"min_samples_leaf": [1, 2, 4,25],"max_depth":[10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],"max_features":['auto', 'sqrt'],"criterion": ["friedman_mse"],"n_estimators":[200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]}svm = SVC(gamma="scale", probability=True)
tuned_parameters = {'kernel':('linear', 'rbf'), 'C':(1,0.25,0.5,0.75)}def baseline_model(optimizer='adam', learn_rate=0.01):model = Sequential()model.add(Dense(100, input_dim=X_res.shape[1], activation='relu')) model.add(Dropout(0.5))model.add(Dense(50, activation='relu')) #8 is the dim/ the number of hidden units (units are the kernel)model.add(Dense(4, activation='softmax'))model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])return modelkeras = KerasClassifier(build_fn=baseline_model, batch_size=32, epochs=100, verbose=0)
learn_rate = [0.001, 0.01, 0.1, 0.2, 0.3]
optimizer = ['SGD', 'RMSprop', 'Adagrad', 'Adadelta', 'Adam', 'Adamax', 'Nadam']
kerasparams = dict(optimizer=optimizer, learn_rate=learn_rate)inner_cv = KFold(n_splits=10, shuffle=True, random_state=seed)
outer_cv = KFold(n_splits=10, shuffle=True, random_state=seed)models = []
models.append(('GBM', GridSearchCV(gbm, param, cv=inner_cv,iid=False, n_jobs=1)))
models.append(('RFC', GridSearchCV(rfc, param_grid, cv=inner_cv,iid=False, n_jobs=1)))
models.append(('LR', GridSearchCV(logreg, LR_par, cv=inner_cv, iid=False, n_jobs=1)))
models.append(('SVM', GridSearchCV(svm, tuned_parameters, cv=inner_cv, iid=False, n_jobs=1)))
models.append(('MLP', GridSearchCV(mlp, parameter_space, cv=inner_cv,iid=False, n_jobs=1)))
models.append(('Keras', GridSearchCV(estimator=keras, param_grid=kerasparams, cv=inner_cv,iid=False, n_jobs=1)))results = []
names = []
scoring = 'accuracy'
X_train, X_test, Y_train, Y_test = train_test_split(X_res, y_res, test_size=0.2, random_state=0)for name, model in models:nested_cv_results = model_selection.cross_val_score(model, X_res, y_res, cv=outer_cv, scoring=scoring)results.append(nested_cv_results)names.append(name)msg = "Nested CV Accuracy %s: %f (+/- %f )" % (name, nested_cv_results.mean()*100, nested_cv_results.std()*100)print(msg)model.fit(X_train, Y_train)print('Test set accuracy: {:.2f}'.format(model.score(X_test, Y_test)*100), '%')print("Best Parameters: \n{}\n".format(model.best_params_))print("Best CV Score: \n{}\n".format(model.best_score_))
As an example, most of the dataset is binary and looks like this:
gene Tissue Druggable Eigenvalue CADDvalue Catalogpresence Category
ACE 1 1 1 0 1 Certain
ABO 1 0 0 0 0 Likely
TP53 1 1 0 0 0 Possible
Any guidance on how I could speed this up would be appreciated.
Edit: I have also tried using parallel processing with dask, but I am not sure I am doing it right, and it doesn't seem to run any faster:
for name, model in models:with joblib.parallel_backend('dask'):nested_cv_results = model_selection.cross_val_score(model, X_res, y_res, cv=outer_cv, scoring=scoring)results.append(nested_cv_results)names.append(name)msg = "Nested CV Accuracy %s: %f (+/- %f )" % (name, nested_cv_results.mean()*100, nested_cv_results.std()*100)print(msg)model.fit(X_train, Y_train)print('Test set accuracy: {:.2f}'.format(model.score(X_test, Y_test)*100), '%')#print("Best Estimator: \n{}\n".format(model.best_estimator_))print("Best Parameters: \n{}\n".format(model.best_params_))print("Best CV Score: \n{}\n".format(model.best_score_)) #average of all cv folds for a single combination of the parameters you specify
Edit: also to note with reducing the gridsearch, I have tried with for example 5 parameters per model however this still takes several hours to complete, so whilst trimming down the number will be helpful, if there is any advice for efficency beyond that I would be grateful.