Question 1

I've saved my classifier pipeline using joblib:

vec = TfidfVectorizer(sublinear_tf=True, max_df=0.5, ngram_range=(1, 3))
pac_clf = PassiveAggressiveClassifier(C=1)
vec_clf = Pipeline([('vectorizer', vec), ('pac', pac_clf)])
vec_clf.fit(X_train,y_train)
joblib.dump(vec_clf, 'class.pkl', compress=9)

Now i'm trying to use it in a production env:

def classify(title):#load classifier and predictclassifier = joblib.load('class.pkl')#vectorize/transform the new title then predictvectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, ngram_range=(1, 3))X_test = vectorizer.transform(title)predict = classifier.predict(X_test)return predict

The error i'm getting is: ValueError: Vocabulary wasn't fitted or is empty! I guess i should load the Vocabulary from te joblid but i can't get it to work

Question 2

Just replace:

  #load classifier and predictclassifier = joblib.load('class.pkl')#vectorize/transform the new title then predictvectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, ngram_range=(1, 3))X_test = vectorizer.transform(title)predict = classifier.predict(X_test)return predict

by:

  # load the saved pipeline that includes both the vectorizer# and the classifier and predictclassifier = joblib.load('class.pkl')predict = classifier.predict(X_test)return predict

class.pkl includes the full pipeline, there is no need to create a new vectorizer instance. As the error message says you need to reuse the vectorizer that was trained in the first place because the feature mapping from token (string ngrams) to column index is saved in the vectorizer itself. This mapping is named the "vocabulary".

Bringing a classifier to production

Related Q&A

how to count the frequency of letters in text excluding whitespace and numbers? [duplicate]

Fastest algorithm for finding overlap between two very large lists?

Call Postgres SQL stored procedure From Django

How can I mix decorators with the @contextmanager decorator?

supervisord always returns exit status 127 at WebFaction

One dimensional Mahalanobis Distance in Python

DeprecationWarning: please use dns.resolver.Resolver.resolve()

python cannot find module when using ssh

Python sklearn installation windows

Python PSD layers?