How to speed up nested cross validation in python?

2024/4/14 9:52:57

From what I've found there is 1 other question like this (Speed-up nested cross-validation) however installing MPI does not work for me after trying several fixes also suggested on this site and microsoft, so I am hoping there is another package or answer to this question.

I am looking to compare multiple algorithms and gridsearch a wide range of parameters (maybe too many parameters?), what ways are there besides mpi4py which could speed up running my code? As I understand it I cannot use n_jobs=-1 as that is then not nested?

Also to note, I have not been able to run this on the many parameters I am trying to look at below (runs longer than I have time). Only have results after 2 hours if I give each model only 2 parameters to compare. Also I run this code on a dataset of 252 rows and 25 feature columns with 4 categorical variables to predict ('certain', 'likely', 'possible', or 'unknown') whether a gene (with 252 genes) affects a disease. Using SMOTE increases the sample size to 420 which is then what goes into use.

dataset= pd.read_csv('data.csv')
data = dataset.drop(["gene"],1)
df = data.iloc[:,0:24]
df = df.fillna(0)
X = MinMaxScaler().fit_transform(df)le = preprocessing.LabelEncoder()
encoded_value = le.fit_transform(["certain", "likely", "possible", "unlikely"])
Y = le.fit_transform(data["category"])sm = SMOTE(random_state=100)
X_res, y_res = sm.fit_resample(X, Y)seed = 7
logreg = LogisticRegression(penalty='l1', solver='liblinear',multi_class='auto')
LR_par= {'penalty':['l1'], 'C': [0.5, 1, 5, 10], 'max_iter':[500, 1000, 5000]}rfc =RandomForestClassifier()
param_grid = {'bootstrap': [True, False],'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],'max_features': ['auto', 'sqrt'],'min_samples_leaf': [1, 2, 4,25],'min_samples_split': [2, 5, 10, 25],'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]}mlp = MLPClassifier(random_state=seed)
parameter_space = {'hidden_layer_sizes': [(10,20), (10,20,10), (50,)],'activation': ['tanh', 'relu'],'solver': ['adam', 'sgd'],'max_iter': [10000],'alpha': [0.1, 0.01, 0.001],'learning_rate': ['constant','adaptive']}gbm = GradientBoostingClassifier(min_samples_split=25, min_samples_leaf=25)
param = {"loss":["deviance"],"learning_rate": [0.15,0.1,0.05,0.01,0.005,0.001],"min_samples_split": [2, 5, 10, 25],"min_samples_leaf": [1, 2, 4,25],"max_depth":[10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],"max_features":['auto', 'sqrt'],"criterion": ["friedman_mse"],"n_estimators":[200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]}svm = SVC(gamma="scale", probability=True)
tuned_parameters = {'kernel':('linear', 'rbf'), 'C':(1,0.25,0.5,0.75)}def baseline_model(optimizer='adam', learn_rate=0.01):model = Sequential()model.add(Dense(100, input_dim=X_res.shape[1], activation='relu')) model.add(Dropout(0.5))model.add(Dense(50, activation='relu')) #8 is the dim/ the number of hidden units (units are the kernel)model.add(Dense(4, activation='softmax'))model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])return modelkeras = KerasClassifier(build_fn=baseline_model, batch_size=32, epochs=100, verbose=0)
learn_rate = [0.001, 0.01, 0.1, 0.2, 0.3]
optimizer = ['SGD', 'RMSprop', 'Adagrad', 'Adadelta', 'Adam', 'Adamax', 'Nadam']
kerasparams = dict(optimizer=optimizer, learn_rate=learn_rate)inner_cv = KFold(n_splits=10, shuffle=True, random_state=seed)
outer_cv = KFold(n_splits=10, shuffle=True, random_state=seed)models = []
models.append(('GBM', GridSearchCV(gbm, param, cv=inner_cv,iid=False, n_jobs=1)))
models.append(('RFC', GridSearchCV(rfc, param_grid, cv=inner_cv,iid=False, n_jobs=1)))
models.append(('LR', GridSearchCV(logreg, LR_par, cv=inner_cv, iid=False, n_jobs=1)))
models.append(('SVM', GridSearchCV(svm, tuned_parameters, cv=inner_cv, iid=False, n_jobs=1)))
models.append(('MLP', GridSearchCV(mlp, parameter_space, cv=inner_cv,iid=False, n_jobs=1)))
models.append(('Keras', GridSearchCV(estimator=keras, param_grid=kerasparams, cv=inner_cv,iid=False, n_jobs=1)))results = []
names = []
scoring = 'accuracy'
X_train, X_test, Y_train, Y_test = train_test_split(X_res, y_res, test_size=0.2, random_state=0)for name, model in models:nested_cv_results = model_selection.cross_val_score(model, X_res, y_res, cv=outer_cv, scoring=scoring)results.append(nested_cv_results)names.append(name)msg = "Nested CV Accuracy %s: %f (+/- %f )" % (name, nested_cv_results.mean()*100, nested_cv_results.std()*100)print(msg), Y_train)print('Test set accuracy: {:.2f}'.format(model.score(X_test, Y_test)*100),  '%')print("Best Parameters: \n{}\n".format(model.best_params_))print("Best CV Score: \n{}\n".format(model.best_score_))

As an example, most of the dataset is binary and looks like this:

gene   Tissue    Druggable Eigenvalue CADDvalue Catalogpresence   Category
ACE      1           1         1          0           1            Certain
ABO      1           0         0          0           0            Likely
TP53     1           1         0          0           0            Possible

Any guidance on how I could speed this up would be appreciated.

Edit: I have also tried using parallel processing with dask, but I am not sure I am doing it right, and it doesn't seem to run any faster:

for name, model in models:with joblib.parallel_backend('dask'):nested_cv_results = model_selection.cross_val_score(model, X_res, y_res, cv=outer_cv, scoring=scoring)results.append(nested_cv_results)names.append(name)msg = "Nested CV Accuracy %s: %f (+/- %f )" % (name, nested_cv_results.mean()*100, nested_cv_results.std()*100)print(msg), Y_train)print('Test set accuracy: {:.2f}'.format(model.score(X_test, Y_test)*100),  '%')#print("Best Estimator: \n{}\n".format(model.best_estimator_))print("Best Parameters: \n{}\n".format(model.best_params_))print("Best CV Score: \n{}\n".format(model.best_score_)) #average of all cv folds for a single combination of the parameters you specify 

Edit: also to note with reducing the gridsearch, I have tried with for example 5 parameters per model however this still takes several hours to complete, so whilst trimming down the number will be helpful, if there is any advice for efficency beyond that I would be grateful.


The Dask-ML has scalable implementations GridSearchCV and RandomSearchCV that are, I believe, drop in replacements for Scikit-Learn. They were developed alongside Scikit-Learn developers.


They can be faster for two reasons:

  • They avoid repeating shared work between different stages of a Pipeline
  • They can scale out to a cluster anywhere you can deploy Dask (which is easy on most cluster infrastructure)

Related Q&A

Streaming video from camera in FastAPI results in frozen image after first frame

I am trying to stream video from a camera using FastAPI, similar to an example I found for Flask. In Flask, the example works correctly, and the video is streamed without any issues. However, when I tr…

Fastest way to concatenate multiple files column wise - Python

What is the fastest method to concatenate multiple files column wise (within Python)?Assume that I have two files with 1,000,000,000 lines and ~200 UTF8 characters per line.Method 1: Cheating with pas…

Can autograd in pytorch handle a repeated use of a layer within the same module?

I have a layer layer in an nn.Module and use it two or more times during a single forward step. The output of this layer is later inputted to the same layer. Can pytorchs autograd compute the grad of t…

Altering numpy function output array in place

Im trying to write a function that performs a mathematical operation on an array and returns the result. A simplified example could be:def original_func(A):return A[1:] + A[:-1]For speed-up and to avoi…

Does the E-factory of lxml support dynamically generated data?

Is there a way of creating the tags dynamically with the E-factory of lxml? For instance I get a syntax error for the following code:E.BODY(E.TABLE(for row_num in range(len(ws.rows)):row = ws.rows[row…

Check if datetime object in pandas has a timezone?

Im importing data into pandas and want to remove any timezones – if theyre present in the data. If the data has a time zone, the following code works successfully: col = "my_date_column" df[…

Extract translator comments with xgettext from JavaScript (in Python mode)

I have a pretty well-working command that extracts strings from all my .js and .html files (which are just Underscore templates). However, it doesnt seem to work for Translator comments.For example, I …

Embedding python + numpy code into C++ dll callback

I am new of python embedding. I am trying to embed python + numpy code inside a C++ callback function (inside a dll)the problem i am facing is the following. if i have:Py_Initialize(); // some python g…

How to parse single file using Python bindings to Clang?

I am writing a simple tool to help with refactoring the source code of our application. I would like to parse C++ code based on wxWidgets library, which defines GUI and produce XML .ui file to use with…

How can I profile a Kivy application?

Im building a game using Kivy. Im encountering performance issues so I decided to profile the program.I tried to run it by:python -m cProfile main.pyThe application screen stays black. After several se…