KFolds Cross Validation vs train_test_split

2024/10/10 14:22:45

I just built my first random forest classifier today and I am trying to improve its performance. I was reading about how cross-validation is important to avoid overfitting of data and hence obtain better results. I implemented StratifiedKFold using sklearn, however, surprisingly this approach resulted to be less accurate. I have read numerous posts suggesting that cross-validating is much more efficient than train_test_split.

Estimator:

rf = RandomForestClassifier(n_estimators=100, random_state=42)

K-Fold:

ss = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
for train_index, test_index in ss.split(features, labels):train_features, test_features = features[train_index], features[test_index]train_labels, test_labels = labels[train_index], labels[test_index]

TTS:

train_feature, test_feature, train_label, test_label = \train_test_split(features, labels, train_size=0.8, test_size=0.2, random_state=42)

Below are results:

CV:

AUROC:  0.74
Accuracy Score:  74.74 %.
Specificity:  0.69
Precision:  0.75
Sensitivity:  0.79
Matthews correlation coefficient (MCC):  0.49
F1 Score:  0.77

TTS:

AUROC:  0.76
Accuracy Score:  76.23 %.
Specificity:  0.77
Precision:  0.79
Sensitivity:  0.76
Matthews correlation coefficient (MCC):  0.52
F1 Score:  0.77

Is this actually possible? Or have I wrongly set up my models?

Also, is this the correct way of using cross-validation?

Answer

glad to see you documented yourself !

The reason for that difference is that TTS approach introduces bias (as you are not using all of your observations for testing) this explains the difference.

In the validation approach, only a subset of the observations—thosethat are included in the training set rather than in the validationset—are used to fit the model. Since statistical methods tend to performworse when trained on fewer observations, this suggests that thevalidation set error rate may tend to overestimate the test error ratefor the model fit on the entire data set.

And the results can vary quite a lot:

the validation estimateof the test error rate can be highly variable, depending on preciselywhich observations are included in the training set and whichobservations are included in the validation set

Cross validation deals with this problem by using all the data available and thus eliminating the bias.

Here your results for the TTS approach hold more bias and this should be kept in mind when analysing the results. Maybe you also got lucky on the Test/Validation set sampled

Again, more on that topic here with a great, beginner friendly article : https://codesachin.wordpress.com/2015/08/30/cross-validation-and-the-bias-variance-tradeoff-for-dummies/

For a more in-depth source, refer to the "Model Assessment and selection" Chapter here (source of quoted content):

https://web.stanford.edu/~hastie/Papers/ESLII.pdf

https://en.xdnf.cn/q/69885.html

Related Q&A

Using Keras, how can I input an X_train of images (more than a thousand images)?

My application is accident-avoidance car systems using Machine Learning (Convolutional Neural Networks). My images are 200x100 JPG images and the output is an array of 4 elements: the car would move le…

Fastest way to merge two deques

Exist a faster way to merge two deques than this?# a, b are two deques. The maximum length # of a is greater than the current length # of a plus the current length of bwhile len(b):a.append(b.poplef…

Python cannot find shared library in cron

My Python script runs well in the shell. However when I cron it (under my own account) it gives me the following error:/usr/local/bin/python: error while loading shared libraries: libpython2.7.so.1.0: …

Multiple async unit tests fail, but running them one by one will pass

I have two unit tests, if I run them one by one, they pass. If I run them at class level, one pass and the other one fails at response = await ac.post( with the error message: RuntimeError: Event loop…

Pyusb on Windows 7 cannot find any devices

So I installed Pyusb 1.0.0-alpha-1 Under Windows, I cannot get any handles to usb devices.>>> import usb.core >>> print usb.core.find() NoneI do have 1 usb device plugged in(idVendor=…

How to speed up nested for loops in Python

I have the following Python 2.7 code:listOfLists = [] for l1_index, l1 in enumerate(L1):list = []for l2 in L2:for l3_index,l3 in enumerate(L3):if (L4[l2-1] == l3):value = L5[l2-1] * l1[l3_index]list.ap…

Folium Search Plugin No Results for FeatureGroup

Im trying to add search functionality to a map Im generating in Python with Folium. I see there is a handy Search plugin available and able to implement it successfully and get it added to the map. Unf…

writing dictionary of dictionaries to .csv file in a particular format

I am generating a dictionary out of multiple .csv files and it looks like this (example):dtDict = {AV-IM-1-13991730: {6/1/2014 0:10: 0.96,6/1/2014 0:15: 0.92,6/1/2014 0:20: 0.97},AV-IM-1-13991731: {6/1…

How to import SSL certificates for Firefox with Selenium [in Python]?

Trying to find a way to install a particular SSL certificate in Firefox with Selenium, using the Python WebDriver and FirefoxProfile. We need to use our own, custom certificate which is stored in the …

Cell assignment of a 2-dimensional Matrix in Python, without numpy

Below is my script, which basically creates a zero matrix of 12x8 filled with 0. Then I want to fill it in, one by one. So lets say column 2 row 0 needs to be 5. How do I do that? The example below sh…