retrieve intermediate features from a pipeline in Scikit (Python)

2024/10/9 16:30:51

I am using a pipeline very similar to the one given in this example :

>>> text_clf = Pipeline([('vect', CountVectorizer()),
...                      ('tfidf', TfidfTransformer()),
...                      ('clf', MultinomialNB()),
... ])

over which I use GridSearchCV to find the best estimators over a parameter grid.

However, I would like to get the column names of my training set with the get_feature_names() method from CountVectorizer(). Is this possible without implementing CountVectorizer() outside the pipeline?

Answer

Using the get_params() function, you can get access at the various parts of the pipeline and their respective internal parameters. Here's an example of accessing 'vect'

text_clf = Pipeline([('vect', CountVectorizer()),('tfidf', TfidfTransformer()),('clf', MultinomialNB())]
print text_clf.get_params()['vect']

yields (for me)

CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',lowercase=True, max_df=1.0, max_features=None, min_df=1,ngram_range=(1, 1), preprocessor=None, stop_words=None,strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',tokenizer=None, vocabulary=None)

I haven't fitted the pipeline to any data in this example, so calling get_feature_names() at this point will return an error.

https://en.xdnf.cn/q/70003.html

Related Q&A

Any way to do integer division in sympy?

I have a very long expression that I think can be simplified, and I thought sympy would be the perfect way to do it. Unfortunately the formula relies on a couple of integer divides, and I cant find any…

Scrapy LinkExtractor - Limit the number of pages crawled per URL

I am trying to limit the number of crawled pages per URL in a CrawlSpider in Scrapy. I have a list of start_urls and I want to set a limit on the numbers pages are being crawled in each URL. Once the l…

Python Invalid format string [duplicate]

This question already has answers here:Python time formatting different in Windows(3 answers)Closed 9 years ago.I am trying to print the date in the following format using strftime: 06-03-2007 05:40PMI…

Python template safe substitution with the custom double-braces format

I am trying to substitute variables in the format {{var}} with Pythons Template. from string import Templateclass CustomTemplate(Template):delimiter = {{pattern = r\{\{(?:(?P<escaped>\{\{)|(?P…

Emit signal in standard python thread

I have a threaded application where I do have a network thread. The UI-part passes a callback to this thread. The thread is a normal python thread - its NO QThread.Is it possible to emit PyQT Slot with…

Sqlalchemy from_statement() cannot locate column

I am following the sqlalchemy tutorial in http://docs.sqlalchemy.org/en/rel_0_9/orm/tutorial.htmlNevertheless, instead of using a SQLite backend, I am using MySQL. The problem is that when I try to exe…

Python - how to check if weak reference is still available

I am passing some weakrefs from Python into C++ class, but C++ destructors are actively trying to access the ref when the real object is already dead, obviously it crashes...Is there any Python C/API a…

Django using locals() [duplicate]

This question already has answers here:Django template and the locals trick(8 answers)Closed 5 years ago.I am beginner in web development with Django. I have noticed that the locals() function is used …

python ghostscript: RuntimeError: Can not find Ghostscript library (libgs)

When trying to run hello-world exampleimport sys import ghostscriptargs = ["ps2pdf", # actual value doesnt matter"-dNOPAUSE", "-dBATCH", "-dSAFER","-sDEVICE…

what is the default encoding when python Requests post data is string type?

with fhe following codepayload = 工作报告 总体情况:良好 r = requests.post("http://httpbin.org/post", data=payload)what is the default encoding when Requests post data is string type? UTF8…