Why is numba so fast?

2024/10/6 8:40:07

I want to write a function which will take an index lefts of shape (N_ROWS,) I want to write a function which will create a matrix out = (N_ROWS, N_COLS) matrix such that out[i, j] = 1 if and only if j >= lefts[i]. A simple example of doing this in a loop is here:

class Looped(Strategy):def copy(self, lefts):out = np.zeros([N_ROWS, N_COLS])for k, l in enumerate(lefts): out[k, l:] = 1return out

Now I wanted it to be as fast as possible, so I have different implementations of this function:

  1. The plain python loop
  2. numba with @njit
  3. A pure c++ implementation which I call with ctypes

Here are the results of the average of 100 runs:

Looped took 0.0011599776260009093
Numba took 8.886413300206186e-05
CPP took 0.00013200821400096175

So numba is about 1.5 times than the next fastest implementation which is the c++ one. My question is why?

  • I have heard in similar questions cython was slower because it wasn't being compiled with all the optimisation flags set, but the cpp implementation was compiled with -O3 is that enough for me to have all the possible optimisations the compiler will give me?
  • I do not fully understand how to hand the numpy array to c++, am I inadvertently making a copy of the data here?

# numba implementation@njit
def numba_copy(lefts):out = np.zeros((N_ROWS, N_COLS), dtype=np.float32)for k, l in enumerate(lefts): out[k, l:] = 1.return outclass Numba(Strategy):def __init__(self) -> None:# avoid compilation time when timing numba_copy(np.array([1]))def copy(self, lefts):return numba_copy(lefts)

// array copy cppextern "C" void copy(const long *lefts,  float *outdatav, int n_rows, int n_cols) 
{   for (int i = 0; i < n_rows; i++) {for (int j = lefts[i]; j < n_cols; j++){outdatav[i*n_cols + j] = 1.;}}
}// compiled to a .so using g++ -O3 -shared -o array_copy.so array_copy.cpp
# using cpp implementationclass CPP(Strategy):def __init__(self) -> None:lib = ctypes.cdll.LoadLibrary("./array_copy.so")fun = lib.copyfun.restype = Nonefun.argtypes = [ndpointer(ctypes.c_long, flags="C_CONTIGUOUS"),ndpointer(ctypes.c_float, flags="C_CONTIGUOUS"),ctypes.c_long,ctypes.c_long,]self.fun = fundef copy(self, lefts):outdata = np.zeros((N_ROWS, N_COLS), dtype=np.float32, )self.fun(lefts, outdata, N_ROWS, N_COLS)return outdata

The full code with the timing etc:

import time
import ctypes
from itertools import combinationsimport numpy as np
from numpy.ctypeslib import ndpointer
from numba import njitN_ROWS = 1000
N_COLS = 1000class Strategy:def copy(self, lefts):raise NotImplementedErrordef __call__(self, lefts):s = time.perf_counter()n = 1000for _ in range(n):out = self.copy(lefts)print(f"{type(self).__name__} took {(time.perf_counter() - s)/n}")return outclass Looped(Strategy):def copy(self, lefts):out = np.zeros([N_ROWS, N_COLS])for k, l in enumerate(lefts): out[k, l:] = 1return out@njit
def numba_copy(lefts):out = np.zeros((N_ROWS, N_COLS), dtype=np.float32)for k, l in enumerate(lefts): out[k, l:] = 1.return outclass Numba(Strategy):def __init__(self) -> None:numba_copy(np.array([1]))def copy(self, lefts):return numba_copy(lefts)class CPP(Strategy):def __init__(self) -> None:lib = ctypes.cdll.LoadLibrary("./array_copy.so")fun = lib.copyfun.restype = Nonefun.argtypes = [ndpointer(ctypes.c_long, flags="C_CONTIGUOUS"),ndpointer(ctypes.c_float, flags="C_CONTIGUOUS"),ctypes.c_long,ctypes.c_long,]self.fun = fundef copy(self, lefts):outdata = np.zeros((N_ROWS, N_COLS), dtype=np.float32, )self.fun(lefts, outdata, N_ROWS, N_COLS)return outdatadef copy_over(lefts):strategies = [Looped(), Numba(), CPP()]outs = []for strategy in strategies:o = strategy(lefts)outs.append(o)for s_0, s_1 in combinations(outs, 2):for a, b in zip(s_0, s_1):np.testing.assert_allclose(a, b)if __name__ == "__main__":copy_over(np.random.randint(0, N_COLS, size=N_ROWS))
Answer

Numba currently uses LLVM-Lite to compile the code efficiently to a binary (after the Python code has been translated to an LLVM intermediate representation). The code is optimized like a C++ code compiled with Clang with the flags -O3 and -march=native. This last parameter is very important as it enables LLVM to use wider SIMD instructions on relatively-recent x86-64 processors: AVX and AVX2 (possibly AVX512 for very recent Intel processors). Otherwise, by default, Clang and GCC use only the SSE/SSE2 instructions (because of backward compatibility).

Another difference come from the comparison between GCC and the LLVM code from Numba. Clang/LLVM tends to aggressively unroll the loops while GCC often don't. This has a significant performance impact on the resulting program. In fact, you can see that the generated assembly code from Clang:

With Clang (128 items per loops):

.LBB0_7:vmovups ymmword ptr [r9 + 4*r8 - 480], ymm0vmovups ymmword ptr [r9 + 4*r8 - 448], ymm0vmovups ymmword ptr [r9 + 4*r8 - 416], ymm0vmovups ymmword ptr [r9 + 4*r8 - 384], ymm0vmovups ymmword ptr [r9 + 4*r8 - 352], ymm0vmovups ymmword ptr [r9 + 4*r8 - 320], ymm0vmovups ymmword ptr [r9 + 4*r8 - 288], ymm0vmovups ymmword ptr [r9 + 4*r8 - 256], ymm0vmovups ymmword ptr [r9 + 4*r8 - 224], ymm0vmovups ymmword ptr [r9 + 4*r8 - 192], ymm0vmovups ymmword ptr [r9 + 4*r8 - 160], ymm0vmovups ymmword ptr [r9 + 4*r8 - 128], ymm0vmovups ymmword ptr [r9 + 4*r8 - 96], ymm0vmovups ymmword ptr [r9 + 4*r8 - 64], ymm0vmovups ymmword ptr [r9 + 4*r8 - 32], ymm0vmovups ymmword ptr [r9 + 4*r8], ymm0sub     r8, -128add     rbp, 4jne     .LBB0_7

With GCC (8 items per loops):

.L5:mov     rdx, raxvmovups YMMWORD PTR [rax], ymm0add     rax, 32cmp     rdx, rcxjne     .L5

Thus, to be fair, you need to compare the Numba code with the C++ code compiled with Clang and the above optimization flags.


Note that regarding your needs and the size of your last-level processor cache, you can write a faster platform-specific C++ code using non-temporal stores (NT-stores). NT-stores tell to the processor not to store the array in its cache. Writing data using NT-stores is faster to write huge arrays in RAM, but this can slower when you read the stored array after the copy if the array could fit in the cache (since the array will have to be reloaded from the RAM). In your case (4 MiB array), this is unclear either this would be faster.

https://en.xdnf.cn/q/70385.html

Related Q&A

How to create a field with a list of foreign keys in SQLAlchemy?

I am trying to store a list of models within the field of another model. Here is a trivial example below, where I have an existing model, Actor, and I want to create a new model, Movie, with the field …

Implementing a recursive algorithm in pyspark to find pairings within a dataframe

I have a spark dataframe (prof_student_df) that lists student/professor pair for a timestamp. There are 4 professors and 4 students for each timestamp and each professor-student pair has a “score” (s…

Python Delegate Pattern - How to avoid circular reference?

I would to ask if using the Delegate Pattern in Python would lead to circular references and if so, what would be the best way to implement it to ensure the object and its delegate will be garbage coll…

Render Jinja after jQuery AJAX request to Flask

I have a web application that gets dynamic data from Flask when a select element from HTML is changed. of course that is done via jquery ajax. No probs here I got that.The problem is, the dynamic data …

shape-preserving piecewise cubic interpolation for 3D curve in python

I have a curve in 3D space. I want to use a shape-preserving piecewise cubic interpolation on it similar to pchip in matlab. I researched functions provided in scipy.interpolate, e.g. interp2d, but …

ForeignKey vs OneToOne field django [duplicate]

This question already has answers here:OneToOneField() vs ForeignKey() in Django(12 answers)Closed 9 years ago.I need to extend django user with some additional fields . I found 2 different ways there…

How to sort glob.glob numerically?

I have a bunch of files sorted numerically on a folder, when I try to sort glob.glob I never get the files in the right order.file examples and expected output sorting folder ------ C:\Users\user\Deskt…

How to determine a numpy-array reshape strategy

For a python project I often find myself reshaping and re-arranging n-dimensional numpy arrays. However, I have a hard time to determine how to approach the problem, visualize the outcome of the result…

matplotlib plotting multiple lines in 3D

I am trying to plot multiple lines in a 3D plot using matplotlib. I have 6 datasets with x and y values. What Ive tried so far was, to give each point in the data sets a z-value. So all points in data …

How to get a telegram private channel id with telethon

Hi cant figure out how to solve this problem, so any help will be really appreciated. Im subscribed to a private channel. This channel has no username and I dont have the invite link (the admin just ad…