scipy cdist with sparse matrices

2024/11/3 7:42:34

I need to calculate the distances between two sets of vectors, source_matrix and target_matrix.

I have the following line, when both source_matrix and target_matrix are of type scipy.sparse.csr.csr_matrix:

distances = sp.spatial.distance.cdist(source_matrix, target_matrix)

And I end up getting the following partial exception traceback:

 File "/usr/local/lib/python2.7/site-packages/scipy/spatial/distance.py", line 2060, in cdist[XA] = _copy_arrays_if_base_present([_convert_to_double(XA)])File "/usr/local/lib/python2.7/site-packages/scipy/spatial/distance.py", line 146, in _convert_to_doubleX = X.astype(np.double)
ValueError: setting an array element with a sequence.

Which seem to indicate the sparse matrices are being treated as dense numpy matrices, which both fails and misses the point of using sparse matrices.

Any advice?

Answer

I appreciate this post is quite old, but as one of the comments suggested, you could use the sklearn implementation which accepts sparse vectors and matrices.

Take two random vectors for example

a = scipy.sparse.rand(m=1,n=100,density=0.2,format='csr')
b = scipy.sparse.rand(m=1,n=100,density=0.2,format='csr')
sklearn.metrics.pairwise.pairwise_distances(X=a, Y=b, metric='euclidean')
>>> array([[ 3.14837228]]) # example output

Or even if a is a matrix and b is a vector:

a = scipy.sparse.rand(m=500,n=100,density=0.2,format='csr')
b = scipy.sparse.rand(m=1,n=100,density=0.2,format='csr')
sklearn.metrics.pairwise.pairwise_distances(X=a, Y=b, metric='euclidean')
>>> array([[ 2.9864606 ], # example output[ 3.33862248],[ 3.45803465],[ 3.15453179],...

Scipy spatial.distance does not support sparse matrices, so sklearn would be the best choice here. You can also pass the n_jobs argument to sklearn.metrics.pairwise.pairwise_distances which distributes the computation if your vectors are very large.

Hope that helps

https://en.xdnf.cn/q/73392.html

Related Q&A

NumPy arrays with SQLite

The most common SQLite interface Ive seen in Python is sqlite3, but is there anything that works well with NumPy arrays or recarrays? By that I mean one that recognizes data types and does not requir…

Binary Phase Shift Keying in Python

Im currently working on some code to transmit messages/files/and other data over lasers using audio transformation. My current code uses the hexlify function from the binascii module in python to conve…

Django. Listing files from a static folder

One seemingly basic thing that Im having trouble with is rendering a simple list of static files (say the contents of a single repository directory on my server) as a list of links. Whether this is sec…

inconsistent migration history when changing a django apps name

Im trying to rename one of the apps in my django website. There is another app which depends on it and its mysql tables. I went over all the files in both apps and changed the instances of the old name…

Tensorflow vs Numpy math functions

Is there any real difference between the math functions performed by numpy and tensorflow. For example, exponential function, or the max function? The only difference I noticed is that tensorflow take…

PyQt5: I cant understand QGraphicsScenes setSceneRect(x, y, w, h)

I see some people say if you want to put QGraphicsScenes origin of coordinates at the origin of QGraphicsView, i.e. top-left corner. You need to let both of them have the same size.So here is what I do…

Remove first character from string Django template

I know this has been asked multiple times but the solution that everyone reaches (and the documentation) doesnt seem to be working for me...Trying to remove first characterCode is {{ picture.picture_pa…

Prevent Python logger from printing to console

Im getting mad at the logging module from Python, because I really have no idea anymore why the logger is printing out the logging messages to the console (on the DEBUG level, even though I set my File…

How to remove python assertion when compiling in cython?

so, here is my problem: I code in python, but I need to improve performance in some part of my code that are too slow. A good(and easy) solution seems to be using cython; I tried it and got good result…

ignoring newline character in regex match

I am trying to replace all matching occurrences with title cases using the following script. When there is a newline character between filter words (in this case ABC and DEF) that line doesnt get repla…