Why is vectorized numpy code slower than for loops?

2024/10/7 0:21:14

I have two numpy arrays, X and Y, with shapes (n,d) and (m,d), respectively. Assume that we want to compute the Euclidean distances between each row of X and each row of Y and store the result in array Z with shape (n,m). I have two implementations for this. The first implementation uses two for loops as follows:

for i in range(n):for j in range(m):Z[i,j] = np.sqrt(np.sum(np.square(X[i] - Y[j])))

The second implementation uses only one loop by vectorization:

for i in range(n):Z[i] = np.sqrt(np.sum(np.square(X[i]-Y), axis=1))

When I run these codes on a particular X and Y data, the first implementation takes nearly 30 seconds while the second implementation takes nearly 60 seconds. I expect the second implementation to be faster since it uses vectorization. What is the reason of its slow running? I know that we can obtain faster implementations by fully vectorizing the code, but I don't understand why the second code (which is partially vectorized) is slower than non-vectorized version.

Here is the complete code:

n,m,d = 5000,500,3000
X = np.random.rand(n,d)
Y = np.random.rand(m,d)
Z = np.zeros((n,m))tic = time.time()
for i in range(n):for j in range(m):Z[i,j] = np.sqrt(np.sum(np.square(X[i] - Y[j])))
print('Elapsed time 1: ', time.time()-tic)tic = time.time()
for i in range(n):Z[i] = np.sqrt(np.sum(np.square(X[i]-Y), axis=1))
print('Elapsed time 2: ', time.time()-tic)tic = time.time()
train_squared = np.square(X).sum(axis=1).reshape((1,n))
test_squared = np.square(Y).sum(axis=1).reshape((m,1))
test_train = -2*np.matmul(Y, X.T)
dists = np.sqrt(test_train + train_squared + test_squared)
print('Elapsed time 3: ', time.time()-tic)

And this is the output:

Elapsed time 1:  35.659096002578735
Elapsed time 2:  65.57051086425781
Elapsed time 3:  0.3912069797515869
Answer

I pulled apart your equations and reduced it down to this MVCE:

for i in range(n):for j in range(m):Y[j].copy()for i in range(n):Y.copy()

The copy() here is just to simulate the subtraction from X. The subtraction itself should be quite cheap.

Here's the results on my computer:

  • The first one took 10ms.
  • The second one took 13s!

I'm copying the exact same amount of data. Using your choices n=5000, m=500, d=3000, this code is copying 60 gigabytes of data.

To be honest, I'm not surprised at all that 13 seconds. That's already over 4GB/s, essentially the maximum bandwidth between my CPU and RAM (of e.g. memcpy).

The really surprising thing is that the first test managed to copy 60GB in only 0.01seconds, which translates to 6TB/s!

I'm pretty sure this is because the data isn't actually leaving the CPU at all. It's just bouncing back and forth between the CPU and the L1 cache: an array of 3000 double-precision numbers will easily fit in a 32KiB L1 cache.

Therefore, I deduce that the main reason your second algorithm isn't as great as one would naively expect is because processing a whole chunk of 500×3000 elements per iteration is very unfriendly to the CPU cache: you basically evict the whole cache into RAM! In contrast, your first algorithm is does take advantage of cache to some extent, because the 3000 elements will still be in cache by the time the sum gets computed, so there's not nearly as much data moving between your CPU and RAM. (Once you have the sum, the 3000 element array is "thrown away", which means it will probably just get overwritten in cache and never make it back to the actually RAM.)


Naturally, doing matrix multiplication is insanely faster, because your problem is essentially of the form:

C[i, j] = ∑[k] f(A[i, k], B[j, k])

If you replace f(x, y) with x * y, you can see it's just a variant of matrix multiplication. The operation f is not extremely important here − what is important are how the indices behave in this equation, which determines how your arrays are stored in memory. The essence of matrix multiplication algorithms lies in the ability to cope with this kind of array access through blocking, so in principle the overall algorithm does not change dramatically even for a user-defined f. Unfortunately, in practice there are very few libraries that support user-defined operations, so you have use the trick (X - Y)**2 = X**2 - 2 X Y + Y**2 as you have done. But it gets the job done :D

https://en.xdnf.cn/q/70302.html

Related Q&A

Handle TCP Provider: Error code 0x68 (104)

Im using this code to sync my db with the clients:import pyodbcSYNC_FETCH_ARRAY_SIZE=25000# define connection + cursorconnection = pyodbc.connect()cursor = connection.cursor()query = select some_column…

vectorized radix sort with numpy - can it beat np.sort?

Numpy doesnt yet have a radix sort, so I wondered whether it was possible to write one using pre-existing numpy functions. So far I have the following, which does work, but is about 10 times slower tha…

Which library should I use to write an XLS from Linux / Python?

Id love a good native Python library to write XLS, but it doesnt seem to exist. Happily, Jython does.So Im trying to decide between jexcelapi and Apache HSSF: http://www.andykhan.com/jexcelapi/tutoria…

put_records() only accepts keyword arguments in Kinesis boto3 Python API

from __future__ import print_function # Python 2/3 compatibility import boto3 import json import decimal#kinesis = boto3.resource(kinesis, region_name=eu-west-1) client = boto3.client(kinesis) with ope…

Setting a transparent main window

How to set main window background transparent on QT? Do I need an attribute or a style? Ive tried setting the opacity, but it didnt work for me. app.setStyleSheet("QMainWindow {opacity:0}"

Elementwise division of sparse matrices, ignoring 0/0

I have two sparse matrices E and D, which have non-zero entries at the same places. Now I want to have E/D as a sparse matrix, defined only where D is non-zero.For example take the following code:impor…

Django import export Line number: 1 - uColumn id not found

I am trying to import excel documents into a Django DB. I have added the following code to admin.py and model.py. There seems to be an error in the development of Django. I have read through several di…

Why cant I access builtins if I use a custom dict as a functions globals?

I have a dict subclass like this:class MyDict(dict):def __getitem__(self, name):return globals()[name]This class can be used with eval and exec without issues:>>> eval(bytearray, MyDict()) <…

How to enable autocomplete (IntelliSense) for python package modules?

This question is not about Pygame, Im usin Pygame as an example.While experimenting with Pygame Ive noticed that autocomplete is not working for some modules. For example, if I start typing pygame.mixe…

Integrating a redirection-included method of payment in django-oscar

I am developing a shopping website using django-oscar framework, in fact I am using their sandbox site. I want to add payment to the checkout process, but the thing is, I am totally confused!Ive read t…