I have looked quite extensively on stackoverflow and elsewhere and I can't seem to find an answer to the problem below.
I am trying to modify a parameter of a function that is itself a parameter inside the GridSearchCV
function of sklearn. More specifically, I want to change parameters (here preserve_case = False
) inside the casual_tokenize
function that is passed to the parameter tokenizer
of the function CountVectorizer
.
Here's the specific code :
from sklearn.datasets import fetch_20newsgroups
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import GridSearchCV
from nltk import casual_tokenize
Generating dummy data from 20newsgroup
categories = ['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']
twenty_train = fetch_20newsgroups(subset='train',categories=categories,shuffle=True,random_state=42)
Creating classification pipeline.
Note that the tokenizer can be modified using lambda
. I am wondering if there's another way to do it since it is not working with GridSearchCV
.
text_clf = Pipeline([('vect',CountVectorizer(tokenizer=lambda text:casual_tokenize(text, preserve_case=False))),('tfidf', TfidfTransformer()),('clf', MultinomialNB()),])text_clf.fit(twenty_train.data, twenty_train.target) # this works fine
I then want to compare the default tokenizer of CountVectorizer
with the one in nltk. Note that I am asking the question because I would like to compare more than one tokenizer that each have specific parameters that needs to be specified.
parameters = {'vect':[CountVectorizer(),CountVectorizer(tokenizer=lambda text:casual_tokenize(text, preserve_case=False))]}gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1, cv=5)
gs_clf = gs_clf.fit(twenty_train.data[:100], twenty_train.target[:100])
gs_clf.fit
gives the following error : PicklingError: Can't pickle <function at 0x1138c5598>: attribute lookup on main failed
So my questions are :
- Does anybody know how to deal with this issue specifically with
GridSearchCV
. - Is there a better pythonic way of dealing with passing parameters to a function that will also be a parameter ?