Creating a TfidfVectorizer over a text column of huge pandas dataframe

2024/9/8 8:53:23

I need to get matrix of TF-IDF features from the text stored in columns of a huge dataframe, loaded from a CSV file (which cannot fit in memory). I am trying to iterate over dataframe using chunks but it is returning generator objects which is not an expected variable type for the method TfidfVectorizer. I guess I am doing something wrong while writing a generator method ChunkIteratorshown below.

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer#Will work only for small Dataset
csvfilename = 'data_elements.csv'
df = pd.read_csv(csvfilename)
vectorizer = TfidfVectorizer()
corpus  = df['text_column'].values
vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())#Trying to use a generator to parse over a huge dataframe
def ChunkIterator(filename):for chunk in pd.read_csv(csvfilename, chunksize=1):yield chunk['text_column'].valuescorpus  = ChunkIterator(csvfilename)
vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())

Can anybody please advise how to modify the ChunkIterator method above, or any other approach using dataframe. I would like to avoid creating separate text files for each row in the dataframe. Following is some dummy csv file data for recreating the scenario.

id,text_column,tags
001, This is the first document .,['sports','entertainment']
002, This document is the second document .,"['politics', 'asia']"
003, And this is the third one .,['europe','nato']
004, Is this the first document ?,"['sports', 'soccer']"
Answer

The method accepts generators just fine. But it requires a iterable of raw documents, i.e. strings. Your generator is an iterable of numpy.ndarray objects. So try something like:

def ChunkIterator(filename):for chunk in pd.read_csv(csvfilename, chunksize=1):for document in chunk['text_column'].values:yield document

Note, I don't really understand why you are using pandas here. Just use the regular csv module, something like:

import csv
def doc_generator(filepath, textcol=0, skipheader=True):with open(filepath) as f:reader = csv.reader(f)if skipheader:next(reader, None)for row in reader:yield row[textcol]

So, in your case, pass 1 to textcol, for example:

In [1]: from sklearn.feature_extraction.text import TfidfVectorizerIn [2]: import csv...: def doc_generator(filepath, textcol=0, skipheader=True):...:     with open(filepath) as f:...:         reader = csv.reader(f)...:         if skipheader:...:             next(reader, None)...:         for row in reader:...:             yield row[textcol]...:In [3]: vectorizer = TfidfVectorizer()In [4]: result = vectorizer.fit_transform(doc_generator('testing.csv', textcol=1))In [5]: result
Out[5]:
<4x9 sparse matrix of type '<class 'numpy.float64'>'with 21 stored elements in Compressed Sparse Row format>In [6]: result.todense()
Out[6]:
matrix([[ 0.        ,  0.46979139,  0.58028582,  0.38408524,  0.        ,0.        ,  0.38408524,  0.        ,  0.38408524],[ 0.        ,  0.6876236 ,  0.        ,  0.28108867,  0.        ,0.53864762,  0.28108867,  0.        ,  0.28108867],[ 0.51184851,  0.        ,  0.        ,  0.26710379,  0.51184851,0.        ,  0.26710379,  0.51184851,  0.26710379],[ 0.        ,  0.46979139,  0.58028582,  0.38408524,  0.        ,0.        ,  0.38408524,  0.        ,  0.38408524]])
https://en.xdnf.cn/q/72245.html

Related Q&A

Automatically convert jupyter notebook to .py

I know there have been a few questions about this but I have not found anything robust enough.Currently I am using, from terminal, a command that creates .py, then moves them to another folder:jupyter …

Schematron validation with lxml in Python: how to retrieve validation errors?

Im trying to do some Schematron validation with lxml. For the specific application Im working at, its important that any tests that failed the validation are reported back. The lxml documentation menti…

Getting Query Parameters as Dictionary in FastAPI [duplicate]

This question already has answers here:How to get query params including keys with blank values using FastAPI?(2 answers)Closed 6 months ago.I spent last month learning Flask, and am now moving on to …

Python Generated Signature for S3 Post

I think Ive read nearly everything there is to read on base-64 encoding of a signature for in-browser, form-based post to S3: old docs and new docs. For instance:http://doc.s3.amazonaws.com/proposals/…

Bringing a classifier to production

Ive saved my classifier pipeline using joblib: vec = TfidfVectorizer(sublinear_tf=True, max_df=0.5, ngram_range=(1, 3)) pac_clf = PassiveAggressiveClassifier(C=1) vec_clf = Pipeline([(vectorizer, vec)…

how to count the frequency of letters in text excluding whitespace and numbers? [duplicate]

This question already has answers here:Using a dictionary to count the items in a list(10 answers)Closed last year.Use a dictionary to count the frequency of letters in the input string. Only letters s…

Fastest algorithm for finding overlap between two very large lists?

Im trying to build an algorithm in Python to filter a large block of RDF data. I have one list consisting of about 70 thousand items formatted like <"datum">.I then have about 6GB worth…

Call Postgres SQL stored procedure From Django

I am working on a Django Project with a Postgres SQL Database. I have written a stored procedure that runs perfectly on Postgres.Now I want to call that stored procedure from Django 1.5 .. I have writt…

How can I mix decorators with the @contextmanager decorator?

Here is the code Im working with:from contextlib import contextmanager from functools import wraps class with_report_status(object):def __init__(self, message):self.message = messagedef __call__(self, …

supervisord always returns exit status 127 at WebFaction

I keep getting the following errors from supervisord at webFaction when tailing the log:INFO exited: my_app (exit status 127; not expected) INFO gave up: my_app entered FATAL state, too many start retr…