Why is pandas.series.map so shockingly slow?

2024/9/25 16:35:07

Some days I just hate using middleware. Take this for example: I'd like to have a lookup table that maps values from a set of inputs (domain) values, to outputs (range) values. The mapping is unique. A Python map can do this, but since the map is quite big I figured, why not use a ps.Series and its index, which has added benefit that I can:

  • pass in multiple values to be mapped as a series (hopefully faster than dictionary lookup)
  • the original series' index in maintained in the result

like so:

domain2range = pd.Series(allrangevals, index=alldomainvals)
# Apply the map
query_vals = pd.Series(domainvals, index=someindex)
result = query_vals.map(domain2range)
assert result.index is someindex # Nice
assert (result.values in allrangevals).all() # Nice

Works as expected. But not. The above .map's time cost grows with len(domain2range) not (more sensibly) O(len(query_vals)) as can be shown:

numiter = 100
for n in [10, 1000, 1000000, 10000000,]:domain = np.arange(0, n)range = domain+10maptable = pd.Series(range, index=domain).sort_index()query_vals = pd.Series([1,2,3])def f():query_vals.map(maptable)print n, timeit.timeit(stmt=f, number=numiter)/numiter10 0.000630810260773
1000 0.000978469848633
1000000 0.00130645036697
10000000 0.0162791204453

facepalm. At n=10000000 its taken (0.01/3) second per mapped value.

So, questions:

  • is Series.map expected to behave like this? Why is it so utterly, ridiculously slow? I think I'm using it as shown in the docs.
  • is there a fast way to use pandas to do table-lookup. It seems like the above is not it?
Answer

https://github.com/pandas-dev/pandas/issues/21278

Warmup was the issue. (double facepalm). Pandas silently builds and caches a hash index at first use (O(maplen)). Calling the tested function and prebuilding the indexgets much better performance.

numiter = 100
for n in [10, 100000, 1000000, 10000000,]:domain = np.arange(0, n)range = domain+10maptable = pd.Series(range, index=domain) #.sort_index()query_vals = pd.Series([1,2,3])def f1():query_vals.map(maptable)f1()print "Pandas1 ", n, timeit.timeit(stmt=f1, number=numiter)/numiterdef f2():query_vals.map(maptable.get)f2()print "Pandas2 ", n, timeit.timeit(stmt=f2, number=numiter)/numitermaptabledict = maptable.to_dict()query_vals_list = pd.Series([1,2,3]).tolist()def f3():{k: maptabledict[k] for k in query_vals_list}f3()print "Py dict ", n, timeit.timeit(stmt=f3, number=numiter)/numiterprintpd.show_versions()
Pandas1  10 0.000621199607849
Pandas2  10 0.000686831474304
Py dict  10 2.0170211792e-05Pandas1  100000 0.00149286031723
Pandas2  100000 0.00118808984756
Py dict  100000 8.47816467285e-06Pandas1  1000000 0.000708899497986
Pandas2  1000000 0.000479419231415
Py dict  1000000 1.64794921875e-05Pandas1  10000000 0.000798969268799
Pandas2  10000000 0.000410139560699
Py dict  10000000 1.47914886475e-05

... although a little depressing that python dictionaries are 10x faster.

https://en.xdnf.cn/q/71558.html

Related Q&A

Viewset create custom assign value in Django Rest Framework

Would like to set a CustomUsers username by using the input email, but where to do the custom assigning, in view? At the same time it receiving a file as well.Models.pyclass CustomUser(AbstractUser):a…

Remove a relation many-to-many (association object) on Sqlalchemy

Im stuck with a SqlAlchemy problem.I just want to delete an relation. This relation is made by an association object.modelsclass User(db.Model, UserMixin):id = db.Column(db.Integer, pr…

Spark Dataframes: Skewed Partition after Join

Ive two dataframes, df1 with 22 million records and df2 with 2 million records. Im doing the right join on email_address as a key. test_join = df2.join(df1, "email_address", how = right).cach…

Caught TypeError while rendering: __init__() got an unexpected keyword argument use_decimal

While running the program i am getting the following error messageCaught TypeError while rendering: __init__() got an unexpected keyword argument use_decimalHere is my code i am using jquery 1.6.4 d…

How to get chunks of elements from a queue?

I have a queue from which I need to get chunks of 10 entries and put them in a list, which is then processed further. The code below works (the "processed further" is, in the example, just pr…

Receiving commandline input while listening for connections in Python

I am trying to write a program that has clients connect to it while the server is still able to send commands to all of the clients. I am using the "Twisted" solution. How can I go about this…

Passing a parameter through AJAX URL with Django

Below is my code. n logs correctly in the console, and everything works perfectly if I manually enter the value for n into url: {% url "delete_photo" iddy=2%}. Alas, when I try to use n as a …

WARNING: toctree contains reference to nonexisting document error with Sphinx

I used the sphinx-quickstart to set everything up. I used doc/ for the documentation root location. The folder containing my package is setup as: myfolder/doc/mypackage/__init__.pymoprob.py...After the…

Removing nan from list - Python

I am trying to remove nan from a list, but it is refusing to go. I have tried both np.nan and nan.This is my code:ztt = [] for i in z:if i != nan:ztt.append(i) zttor:ztt = [] for i in z:if i != np.nan…

Safely unpacking results of str.split [duplicate]

This question already has answers here:How do I reliably split a string in Python, when it may not contain the pattern, or all n elements?(5 answers)Closed 6 years ago.Ive often been frustrated by the…