GridsearchCV: cant pickle function error when trying to pass lambda in parameter

2024/10/14 5:17:44

I have looked quite extensively on stackoverflow and elsewhere and I can't seem to find an answer to the problem below.

I am trying to modify a parameter of a function that is itself a parameter inside the GridSearchCV function of sklearn. More specifically, I want to change parameters (here preserve_case = False) inside the casual_tokenize function that is passed to the parameter tokenizer of the function CountVectorizer.

Here's the specific code :

from sklearn.datasets import fetch_20newsgroups
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import GridSearchCV
from nltk import casual_tokenize

Generating dummy data from 20newsgroup

categories = ['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']
twenty_train = fetch_20newsgroups(subset='train',categories=categories,shuffle=True,random_state=42)

Creating classification pipeline.
Note that the tokenizer can be modified using lambda. I am wondering if there's another way to do it since it is not working with GridSearchCV .

text_clf = Pipeline([('vect',CountVectorizer(tokenizer=lambda text:casual_tokenize(text, preserve_case=False))),('tfidf', TfidfTransformer()),('clf', MultinomialNB()),])text_clf.fit(twenty_train.data, twenty_train.target) # this works fine

I then want to compare the default tokenizer of CountVectorizer with the one in nltk. Note that I am asking the question because I would like to compare more than one tokenizer that each have specific parameters that needs to be specified.

parameters = {'vect':[CountVectorizer(),CountVectorizer(tokenizer=lambda text:casual_tokenize(text, preserve_case=False))]}gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1, cv=5)
gs_clf = gs_clf.fit(twenty_train.data[:100], twenty_train.target[:100])

gs_clf.fit gives the following error : PicklingError: Can't pickle <function at 0x1138c5598>: attribute lookup on main failed

So my questions are :

  1. Does anybody know how to deal with this issue specifically with GridSearchCV.
  2. Is there a better pythonic way of dealing with passing parameters to a function that will also be a parameter ?
Answer

1) Does anybody know how to deal with this issue specifically withGridSearchCV.

You can use partial instead of lambda

from functools import partial
from sklearn.externals.joblib import dumpdef add(a, b):return a + bplus_one = partial(add, b=1)
plus_one_lambda = lambda a: a + 1
dump(plus_one, 'add.pkl')          # No problem
dump(plus_one_lambda, 'add.pkl')   # Pickling error

For your case:

tokenizer=partial(casual_tokenize, preserve_case=False)

2) Is there a better pythonic way of dealing with passing parametersto a function that will also be a parameter ?

I think using lambda or partial are both "pythonic ways".

The problem here is that GridSearchCV uses multiprocessing. Which means it may start multiple processes, it have to serialize the parameters in one process and pass them to others (and then the target processes deserialize to get the same parameters).

GridSearchCV use joblib for multiprocessing/ serialization. Joblib cannot handle lambda functions.

https://en.xdnf.cn/q/69448.html

Related Q&A

How to insert a carriage return in a ReportLab paragraph?

Is there a way to insert a carriage return in a Paragraph in ReportLab? I am trying to concatenate a "\n" to my paragraph string but this isnt working. Title = Paragraph("Title" + …

How to get predictions and calculate accuracy for a given test set in fast ai?

Im trying to load a learner which was exported by learn.export() and I want to run it against a test set. I want my test set have labels so that I can measure its accuracy. This is my code: test_src = …

Splitting the legend in matploblib

Is it possible to split up a single big legend into multiple (usually 2) smaller ones.from pylab import *t = arange(0.0, 2.0, 0.01) s = sin(2*pi*t) plot(t, s, linewidth=1.0, label="Graph1") g…

Python 3.x - iloc throws error - single positional indexer is out-of-bounds

I am scraping election data from a website and trying to store it in a dataframe import pandas as pd import bs4 import requestscolumns = [Candidate,Party,Criminal Cases,Education,Age,Total Assets,Liabi…

Supposed automatically threaded scipy and numpy functions arent making use of multiple cores

I am running Mac OS X 10.6.8 and am using the Enthought Python Distribution. I want for numpy functions to take advantage of both my cores. I am having a problem similar to that of this post: multithre…

Golang net.Listen binds to port thats already in use

Port 8888 is already bound on my (OS X 10.13.5) system, by a process running inside a docker container:$ netstat -an | grep 8888 tcp6 0 0 ::1.8888 *.* LISTE…

Aiohttp, Asyncio: RuntimeError: Event loop is closed

I have two scripts, scraper.py and db_control.py. In scraper.py I have something like this: ... def scrape(category, field, pages, search, use_proxy, proxy_file):...loop = asyncio.get_event_loop()to_do…

Python for ios interpreter [duplicate]

This question already has answers here:Closed 11 years ago.Possible Duplicate:Python or Ruby Interpreter on iOS I just discovered this apps pypad and python for ios They have like an interpreter an ed…

Detect when multiprocessing queue is empty and closed

Lets say I have two processes: a reader and a writer. How does the writer detect when the reader has finished writing values?The multiprocessing module has a queue with a close method that seems custo…

imshow and histogram2d: cant get them to work

Im learning Python and this is my first question here. Ive read other topics related to the usage of imshow but didnt find anything useful. Sorry for my bad English.I have plotted a set of points here,…