Evaluate multiple scores on sklearn cross_val_score

2024/11/20 8:27:35

I'm trying to evaluate multiple machine learning algorithms with sklearn for a couple of metrics (accuracy, recall, precision and maybe more).

For what I understood from the documentation here and from the source code(I'm using sklearn 0.17), the cross_val_score function only receives one scorer for each execution. So for calculating multiple scores, I have to :

  1. Execute multiple times
  2. Implement my (time consuming and error prone) scorer

    I've executed multiple times with this code :

    from sklearn.svm import SVC
    from sklearn.naive_bayes import GaussianNB
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.cross_validation import  cross_val_score
    import time
    from sklearn.datasets import  load_irisiris = load_iris()models = [GaussianNB(), DecisionTreeClassifier(), SVC()]
    names = ["Naive Bayes", "Decision Tree", "SVM"]
    for model, name in zip(models, names):print namestart = time.time()for score in ["accuracy", "precision", "recall"]:print score,print " : ",print cross_val_score(model, iris.data, iris.target,scoring=score, cv=10).mean()print time.time() - start
    

And I get this output:

Naive Bayes
accuracy  :  0.953333333333
precision  :  0.962698412698
recall  :  0.953333333333
0.0383198261261
Decision Tree
accuracy  :  0.953333333333
precision  :  0.958888888889
recall  :  0.953333333333
0.0494720935822
SVM
accuracy  :  0.98
precision  :  0.983333333333
recall  :  0.98
0.063080072403

Which is ok, but it's slow for my own data. How can I measure all scores ?

Answer

Since the time of writing this post scikit-learn has updated and made my answer obsolete, see the much cleaner solution below


You can write your own scoring function to capture all three pieces of information, however a scoring function for cross validation must only return a single number in scikit-learn (this is likely for compatibility reasons). Below is an example where each of the scores for each cross validation slice prints to the console, and the returned value is just the sum of the three metrics. If you want to return all these values, you're going to have to make some changes to cross_val_score (line 1351 of cross_validation.py) and _score (line 1601 or the same file).

from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.cross_validation import  cross_val_score
import time
from sklearn.datasets import  load_iris
from sklearn.metrics import accuracy_score, precision_score, recall_scoreiris = load_iris()models = [GaussianNB(), DecisionTreeClassifier(), SVC()]
names = ["Naive Bayes", "Decision Tree", "SVM"]def getScores(estimator, x, y):yPred = estimator.predict(x)return (accuracy_score(y, yPred), precision_score(y, yPred, pos_label=3, average='macro'), recall_score(y, yPred, pos_label=3, average='macro'))def my_scorer(estimator, x, y):a, p, r = getScores(estimator, x, y)print a, p, rreturn a+p+rfor model, name in zip(models, names):print namestart = time.time()m = cross_val_score(model, iris.data, iris.target,scoring=my_scorer, cv=10).mean()print '\nSum:',m, '\n\n'print 'time', time.time() - start, '\n\n'

Which gives:

Naive Bayes
0.933333333333 0.944444444444 0.933333333333
0.933333333333 0.944444444444 0.933333333333
1.0 1.0 1.0
0.933333333333 0.944444444444 0.933333333333
0.933333333333 0.944444444444 0.933333333333
0.933333333333 0.944444444444 0.933333333333
0.866666666667 0.904761904762 0.866666666667
1.0 1.0 1.0
1.0 1.0 1.0
1.0 1.0 1.0Sum: 2.86936507937 time 0.0249638557434 Decision Tree
1.0 1.0 1.0
0.933333333333 0.944444444444 0.933333333333
1.0 1.0 1.0
0.933333333333 0.944444444444 0.933333333333
0.933333333333 0.944444444444 0.933333333333
0.866666666667 0.866666666667 0.866666666667
0.933333333333 0.944444444444 0.933333333333
0.933333333333 0.944444444444 0.933333333333
1.0 1.0 1.0
1.0 1.0 1.0Sum: 2.86555555556 time 0.0237860679626 SVM
1.0 1.0 1.0
0.933333333333 0.944444444444 0.933333333333
1.0 1.0 1.0
1.0 1.0 1.0
1.0 1.0 1.0
0.933333333333 0.944444444444 0.933333333333
0.933333333333 0.944444444444 0.933333333333
1.0 1.0 1.0
1.0 1.0 1.0
1.0 1.0 1.0Sum: 2.94333333333 time 0.043044090271 

As of scikit-learn 0.19.0 the solution becomes much easier

from sklearn.model_selection import cross_validate
from sklearn.datasets import  load_iris
from sklearn.svm import SVCiris = load_iris()
clf = SVC()
scoring = {'acc': 'accuracy','prec_macro': 'precision_macro','rec_micro': 'recall_macro'}
scores = cross_validate(clf, iris.data, iris.target, scoring=scoring,cv=5, return_train_score=True)
print(scores.keys())
print(scores['test_acc'])  

Which gives:

['test_acc', 'score_time', 'train_acc', 'fit_time', 'test_rec_micro', 'train_rec_micro', 'train_prec_macro', 'test_prec_macro']
[ 0.96666667  1.          0.96666667  0.96666667  1.        ]
https://en.xdnf.cn/q/26340.html

Related Q&A

Generate SQL statements from a Pandas Dataframe

I am loading data from various sources (csv, xls, json etc...) into Pandas dataframes and I would like to generate statements to create and fill a SQL database with this data. Does anyone know of a way…

How to translate a model label in Django Admin?

I could translate Django Admin except a model label because I dont know how to translate a model label in Django Admin. So, how can I translate a model label in Django Admin?

converty numpy array of arrays to 2d array

I have a pandas series features that has the following values (features.values)array([array([0, 0, 0, ..., 0, 0, 0]), array([0, 0, 0, ..., 0, 0, 0]),array([0, 0, 0, ..., 0, 0, 0]), ...,array([0, 0, 0, …

profiling a method of a class in Python using cProfile?

Id like to profile a method of a function in Python, using cProfile. I tried the following:import cProfile as profile# Inside the class method... profile.run("self.myMethod()", "output_f…

Installing h5py on an Ubuntu server

I was installing h5py on an Ubuntu server. However it seems to return an error that h5py.h is not found. It gives the same error message when I install it using pip or the setup.py file. What am I miss…

NLTK Named Entity Recognition with Custom Data

Im trying to extract named entities from my text using NLTK. I find that NLTK NER is not very accurate for my purpose and I want to add some more tags of my own as well. Ive been trying to find a way t…

How do I write to the console in Google App Engine?

Often when I am coding I just like to print little things (mostly the current value of variables) out to console. I dont see anything like this for Google App Engine, although I note that the Google Ap…

Does Google App Engine support Python 3?

I started learning Python 3.4 and would like to start using libraries as well as Google App Engine, but the majority of Python libraries only support Python 2.7 and the same with Google App Engine.Shou…

how to subquery in queryset in django?

how can i have a subquery in djangos queryset? for example if i have:select name, age from person, employee where person.id = employee.id and employee.id in (select id from employee where employee.com…

Opening sqlite3 database from python in read-only mode

While using sqlite3 from C/C++ I learned that it has a open-in-read-only mode option, which is very handy to avoid accidental data-corruption. Is there such a thing in the Python binding?