How to fix NaN or infinity issue for sparse matrix in python?

2024/10/7 4:23:47

I'm totally new to python. I've used some code found online and I tried to work on it. So I'm creating a text-document-matrix and I want to add some extra features before training a logistic regression model.

Although I've checked my data with R and I get no error, when I run the logistic regression I get the error "ValueError: Array contains NaN or infinity." I'm not getting the same error when I do not add my own features. My features are in the file "toPython.txt".

Mind the two calls to assert_all_finite function that returns "None"!

Below is the code I use and the output I get:

def _assert_all_finite(X):
if X.dtype.char in np.typecodes['AllFloat'] and not np.isfinite(X.sum()) and not np.isfinite(X).all():raise ValueError("Array contains NaN or infinity.")def assert_all_finite(X):
_assert_all_finite( if sparse.issparse(X) else X)def main():print "loading data.."
traindata = list(np.array(p.read_table('data/train.tsv'))[:,2])
testdata = list(np.array(p.read_table('data/test.tsv'))[:,2])
y = np.array(p.read_table('data/train.tsv'))[:,-1]tfv = TfidfVectorizer(min_df=12,  max_features=None, strip_accents='unicode',  analyzer='word',stop_words='english', lowercase=True,token_pattern=r'\w{1,}',ngram_range=(1, 1), use_idf=1,smooth_idf=1,sublinear_tf=1)rd = lm.LogisticRegression(penalty='l2', dual=True, tol=0.0001, C=1, fit_intercept=True, intercept_scaling=1.0, class_weight=None, random_state=None)X_all = traindata + testdata
lentrain = len(traindata)f = np.array(p.read_table('data/toPython.txt'))
indices = np.nonzero(~np.isnan(f))
b = csr_matrix((f[indices], indices), shape=f.shape, dtype='float')print b.get_shape
**print assert_all_finite(b)**
print "fitting pipeline"
print "transforming data"
X_all = tfv.transform(X_all)
print X_all.get_shapeX_all=hstack( [X_all,b], format='csr' )
print X_all.get_shape**print assert_all_finite(X_all)**X = X_all[:lentrain]
print "3 Fold CV Score: ", np.mean(cross_validation.cross_val_score(rd, X, y, cv=3, scoring='roc_auc'))

And the output is:

loading data..
<bound method csr_matrix.get_shape of <10566x40 sparse matrix of type '<type 'numpy.float64'>'
with 422640 stored elements in Compressed Sparse Row format>>
fitting pipeline
transforming data
<bound method csr_matrix.get_shape of <10566x13913 sparse matrix of type '<type 'numpy.float64'>'
with 1450834 stored elements in Compressed Sparse Row format>>
<bound method csr_matrix.get_shape of <10566x13953 sparse matrix of type '<type 'numpy.float64'>'
with 1873474 stored elements in Compressed Sparse Row format>>
3 Fold CV Score: 
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\site-packages\spyderlib\widgets\externalshell\", line 523, in runfile
execfile(filename, namespace)
File "C:\Users\Stergios\Documents\Python\", line 100, in <module>
File "C:\Users\Stergios\Documents\Python\", line 97, in main
print "3 Fold CV Score: ", np.mean(cross_validation.cross_val_score(rd, X, y, cv=3, scoring='roc_auc'))
File "C:\Python27\lib\site-packages\sklearn\", line 1152, in cross_val_score
for train, test in cv)
File "C:\Python27\lib\site-packages\sklearn\externals\joblib\", line 517, in __call__
self.dispatch(function, args, kwargs)
File "C:\Python27\lib\site-packages\sklearn\externals\joblib\", line 312, in dispatch
job = ImmediateApply(func, args, kwargs)
File "C:\Python27\lib\site-packages\sklearn\externals\joblib\", line 136, in __init__
self.results = func(*args, **kwargs)
File "C:\Python27\lib\site-packages\sklearn\", line 1064, in _cross_val_score
score = scorer(estimator, X_test, y_test)
File "C:\Python27\lib\site-packages\sklearn\metrics\", line 141, in __call__
return self._sign * self._score_func(y, y_pred, **self._kwargs)
File "C:\Python27\lib\site-packages\sklearn\metrics\", line 403, in roc_auc_score
fpr, tpr, tresholds = roc_curve(y_true, y_score)
File "C:\Python27\lib\site-packages\sklearn\metrics\", line 672, in roc_curve
fps, tps, thresholds = _binary_clf_curve(y_true, y_score, pos_label)
File "C:\Python27\lib\site-packages\sklearn\metrics\", line 504, in _binary_clf_curve
y_true, y_score = check_arrays(y_true, y_score)
File "C:\Python27\lib\site-packages\sklearn\utils\", line 233, in check_arrays
File "C:\Python27\lib\site-packages\sklearn\utils\", line 27, in _assert_all_finite
raise ValueError("Array contains NaN or infinity.")
ValueError: Array contains NaN or infinity.

Any ideas? Thank you!!


I found that doing the following, assuming sm is a sparse matrix (mine was CSR matrix, please say something about other types if you know!) worked quite nicely:

Manually replacing nans with appropriate numbers in data vector:

In [4]: np.isnan(
Out[4]: TrueIn [5]:
Out[5]: (553555,)In [6]: = np.nan_to_num( [7]: np.isnan(
Out[7]: FalseIn [8]:
Out[8]: (553555,)

So we no longer have nan values, but matrix explicitly encodes those zeros as valued indices.

Removing explicitly encoded zero values from sparse matrix:

In [9]: sm.eliminate_zeros()In [10]:
Out[10]: (551391,)

And our matrix actually got smaller now, yay!

