consider I have 2 classes of data and I am using sklearn for classification,
def cv_classif_wrapper(classifier, X, y, n_splits=5, random_state=42, verbose=0):'''cross validation wrapper'''cv = StratifiedKFold(n_splits=n_splits, shuffle=True,random_state=random_state)scores = cross_validate(classifier, X, y, cv=cv, scoring=['f1_weighted', 'accuracy', 'recall_weighted', 'precision_weighted'])if verbose:print(f"=====================")print(f"Accuracy: {scores['test_accuracy'].mean():.3f} (+/- {scores['test_accuracy'].std()*2:.3f})")print(f"Recall: {scores['test_recall_weighted'].mean():.3f} (+/- {scores['test_recall_weighted'].std()*2:.3f})")print(f"Precision: {scores['test_precision_weighted'].mean():.3f} (+/- {scores['test_precision_weighted'].std()*2:.3f})")print(f"F1: {scores['test_f1_weighted'].mean():.3f} (+/- {scores['test_f1_weighted'].std()*2:.3f})")return scores
and I call it by
scores = cv_classif_wrapper(LogisticRegression(), Xs, y0, n_splits=5, verbose=1)
Then I calculate the confusion matrix with this:
model = LogisticRegression(random_state=42)
y_pred = cross_val_predict(model, Xs, y0, cv=5)
cm = sklearn.metrics.confusion_matrix(y0, y_pred)
The question is I am getting 0.95 for F1 score but the confusion matrix is
Is this consistent with F1 score=0.95
? Where is wrong if there is?
note that there is 35 subject in class 0 and 364 in class 1.
Accuracy: 0.952 (+/- 0.051)
Recall: 0.952 (+/- 0.051)
Precision: 0.948 (+/- 0.062)
F1: 0.947 (+/- 0.059)
Your data is imbalanced, i.e. the target classes are not equally distributed. As niid pointed out in their answer, the f1 score returned by default is the weighted f1 score, which can be misleading if not interpreted correctly, especially if your classes are not equally important. Think of customer churn or e-mail spam classification: Your model can be 99% correct (or have a very high f1 score) and still be useless.
We usually calcualte metrics to compare different models to each other. For this, often the area under the ROC curve (AUC-ROC) is used. It summarizes the information of the ROC, which shows True-Positive-Rate against False-Positive-Rate for different thresholds. By using this metric, you are using a metric which is independent of the threshold you choose — unlike accuracy, precision, recall and f1 score which all depend on the threshold you choose.
In the case of imbalanced data, the area under the precision recall curve (AUC-PR) is even more suitable for the comparison of different classifiers:
- AUC-ROC is less informative in imbalanced data due to its reliance on sensitivity and specificity, which become less meaningful with class imbalance.
- AUC-PR focuses on precision and recall, which are more sensitive to minority class performance and better suited for imbalanced datasets.
- AUC-PR is more sensitive to class imbalance, providing a more realistic assessment of classifier performance in imbalanced scenarios.
Consequently, you might want to rethink your metrics.
Additionally, LogisticRegression() is not the best classifier for imbalanced data as it might be biased towards the majority class, leading to poor performance on the minority class. You might consider applying strategies to handle imbalanced data.