Suppose I have an array
from array import array
myarr = array('l', [1, 2, 3])
and a variable:
myvar = 4
what is the fastest way to create a new array:
newarray = array('l', [1, 2, 3, 4])
You can assume all elements are of 'long' type
I have tried to create a new array and use array.append()
not sure if it is fastest. I was thinking of using memoryview
like:
malloc(4*sizeof(long))
but I don't know how to copy a shorter array into part of the memoryview. then insert last element into last position.
I am fairly new to Cython. Thanks for any help!
Update:
I compared the following three methods:
Cython: [100000 loops, best of 3: 5.94 µs per loop]
from libc.stdlib cimport mallocdef cappend(long[:] arr, long var, size_t N):cdef long[:] result = <long[:(N+1)]>malloc((N+1)*sizeof(long))result.base[:N] = arrresult.base[N] = varreturn result
array: [1000000 loops, best of 3: 1.21 µs per loop]
from array import array
import copy
def pyappend(arr, x):result = copy.copy(arr)result.append(x)return result
list append: [1000000 loops, best of 3: 480 ns per loop]
def pylistappend(lst, x):result = lst[:]result.append(x)return result
is there hope to improve the cython part and beat the array one?
Cython gives us more access to the internals of array.array
than the "normal" python, so we can utilize it to speed up the code:
- almost by factor
7
for your small example (by eliminating most of the overhead).
- by factor
2
for larger inputs by eliminating an unnecessary array-copy.
Read on for more details.
It's a little bit unusual to try to optimize a function for such small input, but not without (at least theoretical) interest.
So let's start with your functions as baseline:
a=array('l', [1,2,3])
%timeit pyappend(a, 8)
1.03 µs ± 10.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)lst=[1,2,3]
%timeit pylistappend(lst, 8)
279 ns ± 6.03 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
We must be aware: what we are measuring is not the cost of copying but the cost of overhead (python interpreter, calling functions and so on), for example there is no difference whether a
has 3 or 5 elements:
a=array('l', range(5))
%timeit pyappend(a, 8)
1.03 µs ± 6.76 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In the array-version, we have more overhead because we have indirection via copy
module, we can try to eliminate that:
def pyappend2(arr, x): result = array('l',arr) result.append(x) return result%timeit pyappend2(a, 8)
496 ns ± 5.04 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
That is faster. Now let's use cython - this would eliminate the interpreter costs:
%%cythondef cylistappend(lst, x): result = lst[:] result.append(x) return result%%cythonfrom cpython cimport arraydef cyappend(array.array arr, long long int x):cdef array.array res = array.array('l', arr)res.append(x)return res %timeit cylistappend(lst, 8)193 ns ± 12.4 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)%%timeit cyappend(a, 8)421 ns ± 8.08 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
The cython versions is about 33% faster for list
and about 10% faster for array
. The constructor array.array()
expects an iterable, but we already have an array.array
, so we use the functionality from cpython
to get access to internals of the array.array
object and improve the situation a little:
%%cythonfrom cpython cimport arraydef cyappend2(array.array arr, long long int x):cdef array.array res = array.copy(arr)res.append(x)return res%timeit cyappend2(a, 8)305 ns ± 7.25 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
For the next step we need to know how array.array
appends elements: Normally, it over-allocates, so append()
has amortized cost O(1)
, however after array.copy
the new array is exactly the needed number of elements and the next append
invokes reallocation. We need to change that (see here for the description of the used functions):
%%cythonfrom cpython cimport arrayfrom libc.string cimport memcpydef cyappend3(array.array arr, long long int x):cdef Py_ssize_t n=len(arr)cdef array.array res = array.clone(arr,n+1,False)memcpy(res.data.as_voidptr, arr.data.as_voidptr, 8*n)#that is pretty sloppy..res.data.as_longlongs[n]=xreturn res%timeit cyappend3(a, 8)154 ns ± 1.34 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
Similar to your function, the memory is over-allocated, so we don't need to call resize()
any longer. Now we are faster than the list
and almost 7 times faster than the original python version.
Let's compare timings for bigger array sizes (a=array('l',range(1000))
, lst=list(range(1000))
, where copying of the data makes most of the running time:
pyappend 1.84 µs #copy-module is slow!pyappend2 1.02 µscyappend 0.94 µs #cython no big help - we are copying twicecyappend2 0.90 µs #still copying twicecyappend3 0.43 µs #copying only once -> twice as fast!pylistappend 4.09 µs # needs to increment refs of integerscylistappend 3.85 µs # the same as above
Now, eliminating the unnecessary copy for array.array
gives us the expected factor 2.
For even bigger arrays (10000
elements), we see the following:
pyappend 6.9 µs #copy-module is slow!pyappend2 4.8 µscyappend2 4.4 µs cyappend3 4.4 µs
There is no longer a difference between the versions (if one discards the slow copy-module). The reason for this is the changed behavior of the array.array
for such big number of elements: when copying it over-allocates thus avoiding the reallocation after the first append()
.
We can easily check it:
b=array('l', array('l', range(10**3)))#emulate our functions
b.buffer_info()
[] (94481422849232, 1000)
b.append(1)
b.buffer_info()
[] (94481422860352, 1001) # another pointer address -> reallocated
...
b=array('l', array('l', range(10**4)))
b.buffer_info()
[](94481426290064, 10000)
b.append(33)
b.buffer_info()
[](94481426290064, 10001) # the same pointer address -> no reallocation!