Fastest way to compute distance beetween each points in python

2024/9/27 12:06:10

In my project I need to compute euclidian distance beetween each points stored in an array. The entry array is a 2D numpy array with 3 columns which are the coordinates(x,y,z) and each rows define a new point.

I'm usualy working with 5000 - 6000 points in my test cases.

My first algorithm use Cython and my second numpy. I find that my numpy algorithm is faster than cython.

edit: with 6000 points :

numpy 1.76 s / cython 4.36 s

Here's my cython code:

cimport cython
from libc.math cimport sqrt
@cython.boundscheck(False)
@cython.wraparound(False)
cdef void calcul1(double[::1] M,double[::1] R):cdef int i=0cdef int max = M.shape[0]cdef int x,ycdef int start = 1for x in range(0,max,3):for y in range(start,max,3):R[i]= sqrt((M[y] - M[x])**2 + (M[y+1] - M[x+1])**2 + (M[y+2] - M[x+2])**2)i+=1  start += 1

M is a memory view of the initial entry array but flatten() by numpy before the call of the function calcul1(), R is a memory view of a 1D output array to store all the results.

Here's my Numpy code :

def calcul2(M):return np.sqrt(((M[:,:,np.newaxis] - M[:,np.newaxis,:])**2).sum(axis=0))

Here M is the initial entry array but transpose() by numpy before the function call to have coordinates(x,y,z) as rows and points as columns.

Moreover this numpy function is quite convinient because the array it returns is well organise. It's a n by n array with n the number of points and each points has a row and a column. So for example the distance AB is stored at the intersection index of row A and column B.

Here's how I call them (cython function):

cpdef test():cdef double[::1] Mf cdef double[::1] out = np.empty(17998000,dtype=np.float64) # (6000² - 6000) / 2M = np.arange(6000*3,dtype=np.float64).reshape(6000,3) # Example array with 6000 pointsMf = M.flatten() #because my cython algorithm need a 1D arrayMt = M.transpose() # because my numpy algorithm need coordinates as rowscalcul2(Mt)calcul1(Mf,out)

Am I doing something wrong here ? For my project both are not fast enough.

1: Is there a way to improve my cython code in order to beat numpy's speed ?

2: Is there a way to improve my numpy code to compute even faster ?

3: Or any other solutions, but it must be a python/cython (like parallel computing) ?

Thank you.

Answer

Not sure where you are getting your timings, but you can use scipy.spatial.distance:

M = np.arange(6000*3, dtype=np.float64).reshape(6000,3)
np_result = calcul2(M)
sp_result = sd.cdist(M.T, M.T) #Scipy usage
np.allclose(np_result, sp_result)
>>> True

Timings:

%timeit calcul2(M)
1000 loops, best of 3: 313 µs per loop%timeit sd.cdist(M.T, M.T)
10000 loops, best of 3: 86.4 µs per loop

Importantly, its also useful to realize that your output is symmetric:

np.allclose(sp_result, sp_result.T)
>>> True

An alternative is to only compute the upper triangular of this array:

%timeit sd.pdist(M.T)
10000 loops, best of 3: 39.1 µs per loop

Edit: Not sure which index you want to zip, looks like you may be doing it both ways? Zipping the other index for comparison:

%timeit sd.pdist(M)
10 loops, best of 3: 135 ms per loop

Still about 10x faster than your current NumPy implementation.

https://en.xdnf.cn/q/71460.html

Related Q&A

Calling C from Python: passing list of numpy pointers

I have a variable number of numpy arrays, which Id like to pass to a C function. I managed to pass each individual array (using <ndarray>.ctypes.data_as(c_void_p)), but the number of array may va…

Use of initialize in python multiprocessing worker pool

I was looking into the multiprocessing.Pool for workers, trying to initialize workers with some state. The pool can take a callable, initialize, but it isnt passed a reference to the initialized worker…

Pandas: select the first couple of rows in each group

I cant solve this simple problem and Im asking for help here... I have DataFrame as follows and I want to select the first two rows in each group of adf = pd.DataFrame({a:pd.Series([NewYork,NewYork,New…

Pandas: Approximate join on one column, exact match on other columns

I have two pandas dataframes I want to join/merge exactly on a number of columns (say 3) and approximately, i.e nearest neighbour, on one (date) column. I also want to return the difference (days) betw…

Adding a variable in Content disposition response file name-python/django

I am looking to add a a variable into the file name section of my below python code so that the downloaded files name will change based on a users input upon download. So instead of "Data.xlsx&quo…

TkInter: understanding unbind function

Does TkInter unbind function prevents the widget on which it is applied from binding further events to the widget ?Clarification:Lets say I bound events to a canvas earlier in a prgram:canvas.bind(&qu…

Dynamically get dict elements via getattr?

I want to dynamically query which objects from a class I would like to retrieve. getattr seems like what I want, and it performs fine for top-level objects in the class. However, Id like to also specif…

How do I copy an image from the output in Jupyter Notebook 7+?

Ive been working with Jupyter Notebooks for quite a while. When working with visualisations, I like to copy the output image from a cell by right clicking the image and selecting "Copy Image"…

How to join 2 dataframe on year and month in Pandas?

I have 2 dataframe and I want to join them on the basis of month and year from a date without creating extra columns:example :df1 :date_1 value_1 2017-1-15 20 2017-1-31 30 2016-2-15 20df2…

Sorting Python Dictionary based on Key? [duplicate]

This question already has answers here:How do I sort a dictionary by key?(33 answers)Closed 10 years ago.I have created a python dictionary which has keys in this form :11, 10, 00, 01, 20, 21, 31, 30T…