Question 1

I want to write a function which will take an index lefts of shape (N_ROWS,) I want to write a function which will create a matrix out = (N_ROWS, N_COLS) matrix such that out[i, j] = 1 if and only if j >= lefts[i]. A simple example of doing this in a loop is here:

class Looped(Strategy):def copy(self, lefts):out = np.zeros([N_ROWS, N_COLS])for k, l in enumerate(lefts): out[k, l:] = 1return out

Now I wanted it to be as fast as possible, so I have different implementations of this function:

The plain python loop
numba with @njit
A pure c++ implementation which I call with ctypes

Here are the results of the average of 100 runs:

Looped took 0.0011599776260009093
Numba took 8.886413300206186e-05
CPP took 0.00013200821400096175

So numba is about 1.5 times than the next fastest implementation which is the c++ one. My question is why?

I have heard in similar questions cython was slower because it wasn't being compiled with all the optimisation flags set, but the cpp implementation was compiled with -O3 is that enough for me to have all the possible optimisations the compiler will give me?
I do not fully understand how to hand the numpy array to c++, am I inadvertently making a copy of the data here?

# numba implementation@njit
def numba_copy(lefts):out = np.zeros((N_ROWS, N_COLS), dtype=np.float32)for k, l in enumerate(lefts): out[k, l:] = 1.return outclass Numba(Strategy):def __init__(self) -> None:# avoid compilation time when timing numba_copy(np.array([1]))def copy(self, lefts):return numba_copy(lefts)


// array copy cppextern "C" void copy(const long *lefts,  float *outdatav, int n_rows, int n_cols) 
{   for (int i = 0; i < n_rows; i++) {for (int j = lefts[i]; j < n_cols; j++){outdatav[i*n_cols + j] = 1.;}}
}// compiled to a .so using g++ -O3 -shared -o array_copy.so array_copy.cpp

# using cpp implementationclass CPP(Strategy):def __init__(self) -> None:lib = ctypes.cdll.LoadLibrary("./array_copy.so")fun = lib.copyfun.restype = Nonefun.argtypes = [ndpointer(ctypes.c_long, flags="C_CONTIGUOUS"),ndpointer(ctypes.c_float, flags="C_CONTIGUOUS"),ctypes.c_long,ctypes.c_long,]self.fun = fundef copy(self, lefts):outdata = np.zeros((N_ROWS, N_COLS), dtype=np.float32, )self.fun(lefts, outdata, N_ROWS, N_COLS)return outdata

The full code with the timing etc:

import time
import ctypes
from itertools import combinationsimport numpy as np
from numpy.ctypeslib import ndpointer
from numba import njitN_ROWS = 1000
N_COLS = 1000class Strategy:def copy(self, lefts):raise NotImplementedErrordef __call__(self, lefts):s = time.perf_counter()n = 1000for _ in range(n):out = self.copy(lefts)print(f"{type(self).__name__} took {(time.perf_counter() - s)/n}")return outclass Looped(Strategy):def copy(self, lefts):out = np.zeros([N_ROWS, N_COLS])for k, l in enumerate(lefts): out[k, l:] = 1return out@njit
def numba_copy(lefts):out = np.zeros((N_ROWS, N_COLS), dtype=np.float32)for k, l in enumerate(lefts): out[k, l:] = 1.return outclass Numba(Strategy):def __init__(self) -> None:numba_copy(np.array([1]))def copy(self, lefts):return numba_copy(lefts)class CPP(Strategy):def __init__(self) -> None:lib = ctypes.cdll.LoadLibrary("./array_copy.so")fun = lib.copyfun.restype = Nonefun.argtypes = [ndpointer(ctypes.c_long, flags="C_CONTIGUOUS"),ndpointer(ctypes.c_float, flags="C_CONTIGUOUS"),ctypes.c_long,ctypes.c_long,]self.fun = fundef copy(self, lefts):outdata = np.zeros((N_ROWS, N_COLS), dtype=np.float32, )self.fun(lefts, outdata, N_ROWS, N_COLS)return outdatadef copy_over(lefts):strategies = [Looped(), Numba(), CPP()]outs = []for strategy in strategies:o = strategy(lefts)outs.append(o)for s_0, s_1 in combinations(outs, 2):for a, b in zip(s_0, s_1):np.testing.assert_allclose(a, b)if __name__ == "__main__":copy_over(np.random.randint(0, N_COLS, size=N_ROWS))

Question 2

Numba currently uses LLVM-Lite to compile the code efficiently to a binary (after the Python code has been translated to an LLVM intermediate representation). The code is optimized like a C++ code compiled with Clang with the flags -O3 and -march=native. This last parameter is very important as it enables LLVM to use wider SIMD instructions on relatively-recent x86-64 processors: AVX and AVX2 (possibly AVX512 for very recent Intel processors). Otherwise, by default, Clang and GCC use only the SSE/SSE2 instructions (because of backward compatibility).

Another difference come from the comparison between GCC and the LLVM code from Numba. Clang/LLVM tends to aggressively unroll the loops while GCC often don't. This has a significant performance impact on the resulting program. In fact, you can see that the generated assembly code from Clang:

With Clang (128 items per loops):

.LBB0_7:vmovups ymmword ptr [r9 + 4*r8 - 480], ymm0vmovups ymmword ptr [r9 + 4*r8 - 448], ymm0vmovups ymmword ptr [r9 + 4*r8 - 416], ymm0vmovups ymmword ptr [r9 + 4*r8 - 384], ymm0vmovups ymmword ptr [r9 + 4*r8 - 352], ymm0vmovups ymmword ptr [r9 + 4*r8 - 320], ymm0vmovups ymmword ptr [r9 + 4*r8 - 288], ymm0vmovups ymmword ptr [r9 + 4*r8 - 256], ymm0vmovups ymmword ptr [r9 + 4*r8 - 224], ymm0vmovups ymmword ptr [r9 + 4*r8 - 192], ymm0vmovups ymmword ptr [r9 + 4*r8 - 160], ymm0vmovups ymmword ptr [r9 + 4*r8 - 128], ymm0vmovups ymmword ptr [r9 + 4*r8 - 96], ymm0vmovups ymmword ptr [r9 + 4*r8 - 64], ymm0vmovups ymmword ptr [r9 + 4*r8 - 32], ymm0vmovups ymmword ptr [r9 + 4*r8], ymm0sub     r8, -128add     rbp, 4jne     .LBB0_7

With GCC (8 items per loops):

.L5:mov     rdx, raxvmovups YMMWORD PTR [rax], ymm0add     rax, 32cmp     rdx, rcxjne     .L5

Thus, to be fair, you need to compare the Numba code with the C++ code compiled with Clang and the above optimization flags.

Note that regarding your needs and the size of your last-level processor cache, you can write a faster platform-specific C++ code using non-temporal stores (NT-stores). NT-stores tell to the processor not to store the array in its cache. Writing data using NT-stores is faster to write huge arrays in RAM, but this can slower when you read the stored array after the copy if the array could fit in the cache (since the array will have to be reloaded from the RAM). In your case (4 MiB array), this is unclear either this would be faster.

Why is numba so fast?

Related Q&A

How to create a field with a list of foreign keys in SQLAlchemy?

Implementing a recursive algorithm in pyspark to find pairings within a dataframe

Python Delegate Pattern - How to avoid circular reference?

Render Jinja after jQuery AJAX request to Flask

shape-preserving piecewise cubic interpolation for 3D curve in python

ForeignKey vs OneToOne field django [duplicate]

How to sort glob.glob numerically?

How to determine a numpy-array reshape strategy

matplotlib plotting multiple lines in 3D

How to get a telegram private channel id with telethon