Some days I just hate using middleware. Take this for example: I'd like to have a lookup table that maps values from a set of inputs (domain) values, to outputs (range) values. The mapping is unique. A Python map can do this, but since the map is quite big I figured, why not use a ps.Series and its index, which has added benefit that I can:
- pass in multiple values to be mapped as a series (hopefully faster than dictionary lookup)
- the original series' index in maintained in the result
like so:
domain2range = pd.Series(allrangevals, index=alldomainvals)
# Apply the map
query_vals = pd.Series(domainvals, index=someindex)
result = query_vals.map(domain2range)
assert result.index is someindex # Nice
assert (result.values in allrangevals).all() # Nice
Works as expected. But not. The above .map's time cost grows with len(domain2range)
not (more sensibly) O(len(query_vals))
as can be shown:
numiter = 100
for n in [10, 1000, 1000000, 10000000,]:domain = np.arange(0, n)range = domain+10maptable = pd.Series(range, index=domain).sort_index()query_vals = pd.Series([1,2,3])def f():query_vals.map(maptable)print n, timeit.timeit(stmt=f, number=numiter)/numiter10 0.000630810260773
1000 0.000978469848633
1000000 0.00130645036697
10000000 0.0162791204453
facepalm. At n=10000000 its taken (0.01/3) second per mapped value.
So, questions:
- is Series.map expected to behave like this? Why is it so utterly, ridiculously slow? I think I'm using it as shown in the docs.
- is there a fast way to use pandas to do table-lookup. It seems like the above is not it?