How to fix NaN or infinity issue for sparse matrix in python?

2024/10/7 4:23:47

I'm totally new to python. I've used some code found online and I tried to work on it. So I'm creating a text-document-matrix and I want to add some extra features before training a logistic regression model.

Although I've checked my data with R and I get no error, when I run the logistic regression I get the error "ValueError: Array contains NaN or infinity." I'm not getting the same error when I do not add my own features. My features are in the file "toPython.txt".

Mind the two calls to assert_all_finite function that returns "None"!

Below is the code I use and the output I get:

def _assert_all_finite(X):
if X.dtype.char in np.typecodes['AllFloat'] and not np.isfinite(X.sum()) and not np.isfinite(X).all():raise ValueError("Array contains NaN or infinity.")def assert_all_finite(X):
_assert_all_finite(X.data if sparse.issparse(X) else X)def main():print "loading data.."
traindata = list(np.array(p.read_table('data/train.tsv'))[:,2])
testdata = list(np.array(p.read_table('data/test.tsv'))[:,2])
y = np.array(p.read_table('data/train.tsv'))[:,-1]tfv = TfidfVectorizer(min_df=12,  max_features=None, strip_accents='unicode',  analyzer='word',stop_words='english', lowercase=True,token_pattern=r'\w{1,}',ngram_range=(1, 1), use_idf=1,smooth_idf=1,sublinear_tf=1)rd = lm.LogisticRegression(penalty='l2', dual=True, tol=0.0001, C=1, fit_intercept=True, intercept_scaling=1.0, class_weight=None, random_state=None)X_all = traindata + testdata
lentrain = len(traindata)f = np.array(p.read_table('data/toPython.txt'))
indices = np.nonzero(~np.isnan(f))
b = csr_matrix((f[indices], indices), shape=f.shape, dtype='float')print b.get_shape
**print assert_all_finite(b)**
print "fitting pipeline"
tfv.fit(X_all)
print "transforming data"
X_all = tfv.transform(X_all)
print X_all.get_shapeX_all=hstack( [X_all,b], format='csr' )
print X_all.get_shape**print assert_all_finite(X_all)**X = X_all[:lentrain]
print "3 Fold CV Score: ", np.mean(cross_validation.cross_val_score(rd, X, y, cv=3, scoring='roc_auc'))

And the output is:

loading data..
<bound method csr_matrix.get_shape of <10566x40 sparse matrix of type '<type 'numpy.float64'>'
with 422640 stored elements in Compressed Sparse Row format>>
**None**
fitting pipeline
transforming data
<bound method csr_matrix.get_shape of <10566x13913 sparse matrix of type '<type 'numpy.float64'>'
with 1450834 stored elements in Compressed Sparse Row format>>
<bound method csr_matrix.get_shape of <10566x13953 sparse matrix of type '<type 'numpy.float64'>'
with 1873474 stored elements in Compressed Sparse Row format>>
**None**
3 Fold CV Score: 
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 523, in runfile
execfile(filename, namespace)
File "C:\Users\Stergios\Documents\Python\beat_bench.py", line 100, in <module>
main()
File "C:\Users\Stergios\Documents\Python\beat_bench.py", line 97, in main
print "3 Fold CV Score: ", np.mean(cross_validation.cross_val_score(rd, X, y, cv=3, scoring='roc_auc'))
File "C:\Python27\lib\site-packages\sklearn\cross_validation.py", line 1152, in cross_val_score
for train, test in cv)
File "C:\Python27\lib\site-packages\sklearn\externals\joblib\parallel.py", line 517, in __call__
self.dispatch(function, args, kwargs)
File "C:\Python27\lib\site-packages\sklearn\externals\joblib\parallel.py", line 312, in dispatch
job = ImmediateApply(func, args, kwargs)
File "C:\Python27\lib\site-packages\sklearn\externals\joblib\parallel.py", line 136, in __init__
self.results = func(*args, **kwargs)
File "C:\Python27\lib\site-packages\sklearn\cross_validation.py", line 1064, in _cross_val_score
score = scorer(estimator, X_test, y_test)
File "C:\Python27\lib\site-packages\sklearn\metrics\scorer.py", line 141, in __call__
return self._sign * self._score_func(y, y_pred, **self._kwargs)
File "C:\Python27\lib\site-packages\sklearn\metrics\metrics.py", line 403, in roc_auc_score
fpr, tpr, tresholds = roc_curve(y_true, y_score)
File "C:\Python27\lib\site-packages\sklearn\metrics\metrics.py", line 672, in roc_curve
fps, tps, thresholds = _binary_clf_curve(y_true, y_score, pos_label)
File "C:\Python27\lib\site-packages\sklearn\metrics\metrics.py", line 504, in _binary_clf_curve
y_true, y_score = check_arrays(y_true, y_score)
File "C:\Python27\lib\site-packages\sklearn\utils\validation.py", line 233, in check_arrays
_assert_all_finite(array)
File "C:\Python27\lib\site-packages\sklearn\utils\validation.py", line 27, in _assert_all_finite
raise ValueError("Array contains NaN or infinity.")
ValueError: Array contains NaN or infinity.

Any ideas? Thank you!!

Answer

I found that doing the following, assuming sm is a sparse matrix (mine was CSR matrix, please say something about other types if you know!) worked quite nicely:

Manually replacing nans with appropriate numbers in data vector:

In [4]: np.isnan(matrix.data).any()
Out[4]: TrueIn [5]: sm.data.shape
Out[5]: (553555,)In [6]: sm.data = np.nan_to_num(sm.data)In [7]: np.isnan(matrix.data).any()
Out[7]: FalseIn [8]: sm.data.shape
Out[8]: (553555,)

So we no longer have nan values, but matrix explicitly encodes those zeros as valued indices.

Removing explicitly encoded zero values from sparse matrix:

In [9]: sm.eliminate_zeros()In [10]: sm.data.shape
Out[10]: (551391,)

And our matrix actually got smaller now, yay!

https://en.xdnf.cn/q/70285.html

Related Q&A

Mutable default argument for a Python namedtuple

I came across a neat way of having namedtuples use default arguments from here.from collections import namedtuple Node = namedtuple(Node, val left right) Node.__new__.__defaults__ = (None, None, None) …

Windows notification with button using python

I need to make a program that alerts me with a windows notification, and I found out that this can be simply done with the following code. I dont care what library I use from win10toast import ToastNo…

numpy IndexError: too many indices for array when indexing matrix with another

I have a matrix a which I create like this:>>> a = np.matrix("1 2 3; 4 5 6; 7 8 9; 10 11 12")I have a matrix labels which I create like this:>>> labels = np.matrix("1;0…

how to use enum in swig with python?

I have a enum declaration as follows:typedef enum mail_ {Out = 0,Int = 1,Spam = 2 } mail;Function:mail status; int fill_mail_data(int i, &status);In the function above, status gets filled up and wi…

What does except Exception as e mean in python? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.Want to improve this question? Update the question so it focuses on one problem only by editing this post.Closed 4…

Numpy efficient big matrix multiplication

To store big matrix on disk I use numpy.memmap.Here is a sample code to test big matrix multiplication:import numpy as np import timerows= 10000 # it can be large for example 1kk cols= 1000#create some…

Basic parallel python program freezes on Windows

This is the basic Python example from https://docs.python.org/2/library/multiprocessing.html#module-multiprocessing.pool on parallel processingfrom multiprocessing import Pooldef f(x):return x*xif __na…

Formatting multiple worksheets using xlsxwriter

How to copy the same formatting to different sheets of the same Excel file using the xlsxwriter library in Python?The code I tried is: import xlsxwriterimport pandas as pd import numpy as npfrom xlsxw…

Good python library for generating audio files? [closed]

Closed. This question is seeking recommendations for books, tools, software libraries, and more. It does not meet Stack Overflow guidelines. It is not currently accepting answers.We don’t allow questi…

Keras ConvLSTM2D: ValueError on output layer

I am trying to train a 2D convolutional LSTM to make categorical predictions based on video data. However, my output layer seems to be running into a problem:"ValueError: Error when checking targe…