I use GridSearchCV
to find the best parameters in the inner loop of my nested cross-validation. The 'inner winner' is found using GridSearchCV(scorer='balanced_accuracy'
), so as I understand the documentation the model with the highest balanced accuracy on average in the inner folds is the 'best_estimator'. I don't understand what the different arguments for refit
in GridSearchCV
do in combination with the scorer
argument. If refit
is True, what scoring function will be used to estimate the performance of that 'inner winner' when refitted to the dataset? The same scoring function that was passed to scorer
(so in my case 'balanced_accuracy')? Why can you pass also a string to refit
? Does that mean that you can use different functions for 1.) finding the 'inner winner' and 2.) to estimate the performance of that 'inner winner' on the whole dataset?
When refit=True
, sklearn uses entire training set to refit the model. So, there is no test data left to estimate the performance using any scorer
function.
If you use multiple scorer
in GridSearchCV, maybe f1_score
or precision
along with your balanced_accuracy
, sklearn needs to know which one of those scorer
to use to find the "inner winner" as you say. For example with KNN
, f1_score
might have best result with K=5
, but accuracy
might be highest for K=10
. There is no way for sklearn to know which value of hyper-parameter K
is the best.
To resolve that, you can pass one string scorer
to refit
to specify which of those scorer
should ultimately decide best hyper-parameter. This best value will then be used to retrain or refit the model using full dataset. So, when you've got just one scorer
, as your case seems to be, you don't have to worry about this. Simply refit=True
will suffice.