Bringing a classifier to production

2024/9/16 22:48:15

I've saved my classifier pipeline using joblib:

vec = TfidfVectorizer(sublinear_tf=True, max_df=0.5, ngram_range=(1, 3))
pac_clf = PassiveAggressiveClassifier(C=1)
vec_clf = Pipeline([('vectorizer', vec), ('pac', pac_clf)])
vec_clf.fit(X_train,y_train)
joblib.dump(vec_clf, 'class.pkl', compress=9)

Now i'm trying to use it in a production env:

def classify(title):#load classifier and predictclassifier = joblib.load('class.pkl')#vectorize/transform the new title then predictvectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, ngram_range=(1, 3))X_test = vectorizer.transform(title)predict = classifier.predict(X_test)return predict

The error i'm getting is: ValueError: Vocabulary wasn't fitted or is empty! I guess i should load the Vocabulary from te joblid but i can't get it to work

Answer

Just replace:

  #load classifier and predictclassifier = joblib.load('class.pkl')#vectorize/transform the new title then predictvectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, ngram_range=(1, 3))X_test = vectorizer.transform(title)predict = classifier.predict(X_test)return predict

by:

  # load the saved pipeline that includes both the vectorizer# and the classifier and predictclassifier = joblib.load('class.pkl')predict = classifier.predict(X_test)return predict

class.pkl includes the full pipeline, there is no need to create a new vectorizer instance. As the error message says you need to reuse the vectorizer that was trained in the first place because the feature mapping from token (string ngrams) to column index is saved in the vectorizer itself. This mapping is named the "vocabulary".

https://en.xdnf.cn/q/72240.html

Related Q&A

how to count the frequency of letters in text excluding whitespace and numbers? [duplicate]

This question already has answers here:Using a dictionary to count the items in a list(10 answers)Closed last year.Use a dictionary to count the frequency of letters in the input string. Only letters s…

Fastest algorithm for finding overlap between two very large lists?

Im trying to build an algorithm in Python to filter a large block of RDF data. I have one list consisting of about 70 thousand items formatted like <"datum">.I then have about 6GB worth…

Call Postgres SQL stored procedure From Django

I am working on a Django Project with a Postgres SQL Database. I have written a stored procedure that runs perfectly on Postgres.Now I want to call that stored procedure from Django 1.5 .. I have writt…

How can I mix decorators with the @contextmanager decorator?

Here is the code Im working with:from contextlib import contextmanager from functools import wraps class with_report_status(object):def __init__(self, message):self.message = messagedef __call__(self, …

supervisord always returns exit status 127 at WebFaction

I keep getting the following errors from supervisord at webFaction when tailing the log:INFO exited: my_app (exit status 127; not expected) INFO gave up: my_app entered FATAL state, too many start retr…

One dimensional Mahalanobis Distance in Python

Ive been trying to validate my code to calculate Mahalanobis distance written in Python (and double check to compare the result in OpenCV) My data points are of 1 dimension each (5 rows x 1 column). I…

DeprecationWarning: please use dns.resolver.Resolver.resolve()

I am using resolver() as an alternative to socket() as I found that when multiple connections are made to different IPs it ends up stopping working. Anyway it returns a warning to me that I should use …

python cannot find module when using ssh

Im using python on servers. When I run a python command which needs numpy module, if I do ssh <server name> <python command>that server will complain no module named numpy found.However, if…

Python sklearn installation windows

When trying to install Pythons sklearn package on Windows 10 using pip I am given an EnvironmentError that tells me there is no such file or directory of a specific file: ERROR: Could not install packa…

Python PSD layers?

I need to write a Python program for loading a PSD photoshop image, which has multiple layers and spit out png files (one for each layer). Can you do that in Python? Ive tried PIL, but there doesnt se…