Question 1

I need to get matrix of TF-IDF features from the text stored in columns of a huge dataframe, loaded from a CSV file (which cannot fit in memory). I am trying to iterate over dataframe using chunks but it is returning generator objects which is not an expected variable type for the method TfidfVectorizer. I guess I am doing something wrong while writing a generator method ChunkIteratorshown below.

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer#Will work only for small Dataset
csvfilename = 'data_elements.csv'
df = pd.read_csv(csvfilename)
vectorizer = TfidfVectorizer()
corpus  = df['text_column'].values
vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())#Trying to use a generator to parse over a huge dataframe
def ChunkIterator(filename):for chunk in pd.read_csv(csvfilename, chunksize=1):yield chunk['text_column'].valuescorpus  = ChunkIterator(csvfilename)
vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())

Can anybody please advise how to modify the ChunkIterator method above, or any other approach using dataframe. I would like to avoid creating separate text files for each row in the dataframe. Following is some dummy csv file data for recreating the scenario.

id,text_column,tags
001, This is the first document .,['sports','entertainment']
002, This document is the second document .,"['politics', 'asia']"
003, And this is the third one .,['europe','nato']
004, Is this the first document ?,"['sports', 'soccer']"

Question 2

The method accepts generators just fine. But it requires a iterable of raw documents, i.e. strings. Your generator is an iterable of numpy.ndarray objects. So try something like:

def ChunkIterator(filename):for chunk in pd.read_csv(csvfilename, chunksize=1):for document in chunk['text_column'].values:yield document

Note, I don't really understand why you are using pandas here. Just use the regular csv module, something like:

import csv
def doc_generator(filepath, textcol=0, skipheader=True):with open(filepath) as f:reader = csv.reader(f)if skipheader:next(reader, None)for row in reader:yield row[textcol]

So, in your case, pass 1 to textcol, for example:

In [1]: from sklearn.feature_extraction.text import TfidfVectorizerIn [2]: import csv...: def doc_generator(filepath, textcol=0, skipheader=True):...:     with open(filepath) as f:...:         reader = csv.reader(f)...:         if skipheader:...:             next(reader, None)...:         for row in reader:...:             yield row[textcol]...:In [3]: vectorizer = TfidfVectorizer()In [4]: result = vectorizer.fit_transform(doc_generator('testing.csv', textcol=1))In [5]: result
Out[5]:
<4x9 sparse matrix of type '<class 'numpy.float64'>'with 21 stored elements in Compressed Sparse Row format>In [6]: result.todense()
Out[6]:
matrix([[ 0.        ,  0.46979139,  0.58028582,  0.38408524,  0.        ,0.        ,  0.38408524,  0.        ,  0.38408524],[ 0.        ,  0.6876236 ,  0.        ,  0.28108867,  0.        ,0.53864762,  0.28108867,  0.        ,  0.28108867],[ 0.51184851,  0.        ,  0.        ,  0.26710379,  0.51184851,0.        ,  0.26710379,  0.51184851,  0.26710379],[ 0.        ,  0.46979139,  0.58028582,  0.38408524,  0.        ,0.        ,  0.38408524,  0.        ,  0.38408524]])

Creating a TfidfVectorizer over a text column of huge pandas dataframe

Related Q&A

Automatically convert jupyter notebook to .py

Schematron validation with lxml in Python: how to retrieve validation errors?

Getting Query Parameters as Dictionary in FastAPI [duplicate]

Python Generated Signature for S3 Post

Bringing a classifier to production

how to count the frequency of letters in text excluding whitespace and numbers? [duplicate]

Fastest algorithm for finding overlap between two very large lists?

Call Postgres SQL stored procedure From Django

How can I mix decorators with the @contextmanager decorator?

supervisord always returns exit status 127 at WebFaction