Vectorization: Not a valid collection

2024/10/7 22:19:21

I wanna vectorize a txt file containing my training corpus for the OneClassSVM classifier. For that I'm using CountVectorizer from the scikit-learn library. Here's below my code:

def file_to_corpse(file_name, stop_words):array_file = []with open(file_name) as fd:corp = fd.readlines()array_file = np.array(corp)stwf = stopwords.words('french')for w in stop_words:stwf.append(w)vectorizer = CountVectorizer(decode_error = 'replace', stop_words=stwf, min_df=1)X = vectorizer.fit_transform(array_file)return X

When I run my function on my file (around 206346 line) I get the following error and I can't seem to understand it:

Traceback (most recent call last):File "", line 93, in <module> "/home/imane/anaconda/lib/python2.7/site-packages/sklearn/svm/", line 1028, in fitsuper(OneClassSVM, self).fit(X, np.ones(_num_samples(X)), sample_weight=sample_weight,File "/home/imane/anaconda/lib/python2.7/site-packages/sklearn/utils/", line 122, in _num_samples" a valid collection." % x)
TypeError: Singleton array array(<536172x13800 sparse matrix of type '<type 'numpy.int64'>'with 1952637 stored elements in Compressed Sparse Row format>, dtype=object) cannot be considered a valid collection.

Can somebody please help me solve this problem? I've been stuck for a while :).


If you look at the source, you can find it here for instance, you can find that it checks for this condition to be true (x being your array)

if len(x.shape) == 0:

if so, it will raise this exception

TypeError("Singleton array %r cannot be considered a valid collection." % x)

What I would suggest is that you try to find out if array_file or your return value from this function has a shape length > 0

