I just built my first random forest classifier
today and I am trying to improve its performance. I was reading about how cross-validation
is important to avoid overfitting
of data and hence obtain better results. I implemented StratifiedKFold
using sklearn
, however, surprisingly this approach resulted to be less accurate. I have read numerous posts suggesting that cross-validating
is much more efficient than train_test_split
.
Estimator:
rf = RandomForestClassifier(n_estimators=100, random_state=42)
K-Fold:
ss = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
for train_index, test_index in ss.split(features, labels):train_features, test_features = features[train_index], features[test_index]train_labels, test_labels = labels[train_index], labels[test_index]
TTS:
train_feature, test_feature, train_label, test_label = \train_test_split(features, labels, train_size=0.8, test_size=0.2, random_state=42)
Below are results:
CV:
AUROC: 0.74
Accuracy Score: 74.74 %.
Specificity: 0.69
Precision: 0.75
Sensitivity: 0.79
Matthews correlation coefficient (MCC): 0.49
F1 Score: 0.77
TTS:
AUROC: 0.76
Accuracy Score: 76.23 %.
Specificity: 0.77
Precision: 0.79
Sensitivity: 0.76
Matthews correlation coefficient (MCC): 0.52
F1 Score: 0.77
Is this actually possible? Or have I wrongly set up my models?
Also, is this the correct way of using cross-validation?