Question 1

I just built my first random forest classifier today and I am trying to improve its performance. I was reading about how cross-validation is important to avoid overfitting of data and hence obtain better results. I implemented StratifiedKFold using sklearn, however, surprisingly this approach resulted to be less accurate. I have read numerous posts suggesting that cross-validating is much more efficient than train_test_split.

Estimator:

rf = RandomForestClassifier(n_estimators=100, random_state=42)

K-Fold:

ss = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
for train_index, test_index in ss.split(features, labels):train_features, test_features = features[train_index], features[test_index]train_labels, test_labels = labels[train_index], labels[test_index]

TTS:

train_feature, test_feature, train_label, test_label = \train_test_split(features, labels, train_size=0.8, test_size=0.2, random_state=42)

Below are results:

CV:

AUROC:  0.74
Accuracy Score:  74.74 %.
Specificity:  0.69
Precision:  0.75
Sensitivity:  0.79
Matthews correlation coefficient (MCC):  0.49
F1 Score:  0.77

TTS:

AUROC:  0.76
Accuracy Score:  76.23 %.
Specificity:  0.77
Precision:  0.79
Sensitivity:  0.76
Matthews correlation coefficient (MCC):  0.52
F1 Score:  0.77

Is this actually possible? Or have I wrongly set up my models?

Also, is this the correct way of using cross-validation?

Question 2

glad to see you documented yourself !

The reason for that difference is that TTS approach introduces bias (as you are not using all of your observations for testing) this explains the difference.

In the validation approach, only a subset of the observations—thosethat are included in the training set rather than in the validationset—are used to fit the model. Since statistical methods tend to performworse when trained on fewer observations, this suggests that thevalidation set error rate may tend to overestimate the test error ratefor the model fit on the entire data set.

And the results can vary quite a lot:

the validation estimateof the test error rate can be highly variable, depending on preciselywhich observations are included in the training set and whichobservations are included in the validation set

Cross validation deals with this problem by using all the data available and thus eliminating the bias.

Here your results for the TTS approach hold more bias and this should be kept in mind when analysing the results. Maybe you also got lucky on the Test/Validation set sampled

Again, more on that topic here with a great, beginner friendly article : https://codesachin.wordpress.com/2015/08/30/cross-validation-and-the-bias-variance-tradeoff-for-dummies/

For a more in-depth source, refer to the "Model Assessment and selection" Chapter here (source of quoted content):

https://web.stanford.edu/~hastie/Papers/ESLII.pdf

KFolds Cross Validation vs train_test_split

Related Q&A

Using Keras, how can I input an X_train of images (more than a thousand images)?

Fastest way to merge two deques

Python cannot find shared library in cron

Multiple async unit tests fail, but running them one by one will pass

Pyusb on Windows 7 cannot find any devices

How to speed up nested for loops in Python

Folium Search Plugin No Results for FeatureGroup

writing dictionary of dictionaries to .csv file in a particular format

How to import SSL certificates for Firefox with Selenium [in Python]?

Cell assignment of a 2-dimensional Matrix in Python, without numpy