I need to calculate the distances between two sets of vectors, source_matrix
and target_matrix
.
I have the following line, when both source_matrix
and target_matrix
are of type scipy.sparse.csr.csr_matrix
:
distances = sp.spatial.distance.cdist(source_matrix, target_matrix)
And I end up getting the following partial exception traceback:
File "/usr/local/lib/python2.7/site-packages/scipy/spatial/distance.py", line 2060, in cdist[XA] = _copy_arrays_if_base_present([_convert_to_double(XA)])File "/usr/local/lib/python2.7/site-packages/scipy/spatial/distance.py", line 146, in _convert_to_doubleX = X.astype(np.double)
ValueError: setting an array element with a sequence.
Which seem to indicate the sparse matrices are being treated as dense numpy matrices, which both fails and misses the point of using sparse matrices.
Any advice?
I appreciate this post is quite old, but as one of the comments suggested, you could use the sklearn implementation which accepts sparse vectors and matrices.
Take two random vectors for example
a = scipy.sparse.rand(m=1,n=100,density=0.2,format='csr')
b = scipy.sparse.rand(m=1,n=100,density=0.2,format='csr')
sklearn.metrics.pairwise.pairwise_distances(X=a, Y=b, metric='euclidean')
>>> array([[ 3.14837228]]) # example output
Or even if a
is a matrix and b
is a vector:
a = scipy.sparse.rand(m=500,n=100,density=0.2,format='csr')
b = scipy.sparse.rand(m=1,n=100,density=0.2,format='csr')
sklearn.metrics.pairwise.pairwise_distances(X=a, Y=b, metric='euclidean')
>>> array([[ 2.9864606 ], # example output[ 3.33862248],[ 3.45803465],[ 3.15453179],...
Scipy spatial.distance does not support sparse matrices, so sklearn would be the best choice here. You can also pass the n_jobs
argument to sklearn.metrics.pairwise.pairwise_distances
which distributes the computation if your vectors are very large.
Hope that helps