How to improve performance on a lambda function on a massive dataframe

2024/11/13 18:45:24

I have a df with over hundreds of millions of rows.

     latitude                    longitude                   time                    VAL     
0   -39.20000076293945312500    140.80000305175781250000    1972-01-19 13:00:00     1.20000004768371582031  
1   -39.20000076293945312500    140.80000305175781250000    1972-01-20 13:00:00     0.89999997615814208984 
2   -39.20000076293945312500    140.80000305175781250000    1972-01-21 13:00:00     1.50000000000000000000 
3   -39.20000076293945312500    140.80000305175781250000    1972-01-22 13:00:00     1.60000002384185791016 
4   -39.20000076293945312500    140.80000305175781250000    1972-01-23 13:00:00     1.20000004768371582031
... ...

It contains a time column with the type of datetime64 in UTC. The following code is to create a new column isInDST to indicate if the time is in daylight saving period in a local time zone.

df['isInDST'] = pd.DatetimeIndex(df['time']).tz_localize('UTC').tz_convert('Australia/Victoria').map(lambda x : x.dst().total_seconds()!=0)

It takes about 400 seconds to process 15,223,160 rows.

Is there a better approach to achieve this with better performance? Is vectorize a better way?

Answer

All results are calculated on 1M datapoints.

Cython + np.vectorize

7.2 times faster than the original code

%%cython
from cpython.datetime cimport datetime
cpdef bint c_is_in_dst(datetime dt):return dt.dst().total_seconds() != 0 %%timeit
df['isInDST'] = np.vectorize(c_is_in_dst)(df['time'].dt.tz_localize('UTC').dt.tz_convert('Australia/Victoria').dt.to_pydatetime())

1.08 s ± 10.2 ms per loop per loop

np.vectorize

6.5 times faster than the original code

def is_in_dst(dt):return dt.dst().total_seconds() != 0 %%timeit
df['isInDST'] = np.vectorize(is_in_dst)(df['time'].dt.tz_localize('UTC').dt.tz_convert('Australia/Victoria').dt.to_pydatetime())

1.2 s ± 29.3 ms per loop per loop

Based on the documentation (The implementation is essentially a for loop) I expected the result to be the same as for the list comprehension, but it's consistently a little bit better than list comprehension.

List comprehension

5.9 times faster than the original code

%%timeit
df['isInDST'] = [x.dst().total_seconds()!=0 for x in pd.DatetimeIndex(df['time']).tz_localize('UTC').tz_convert('Australia/Victoria')]

1.33 s ± 48.4 ms per loop

This result shows that pandas map/apply is very slow, it adds additional overhead that can be eliminated by just using a python for loop.

Original approach (map on pandas DatetimeIndex)

%%timeit
df['isInDST'] = pd.DatetimeIndex(df['time']).tz_localize('UTC').tz_convert('Australia/Victoria').map(lambda x : x.dst().total_seconds()!=0)

7.82 s ± 84.3 ms per loop

Tested on 1M rows of dummy data

N = 1_000_000
df = pd.DataFrame({"time": [datetime.datetime.now().replace(hour=random.randint(0,23),minute=random.randint(0,59)) for _ in range(N)]})

Also, run the code on 100K and 10M rows - the results are linearly dependant on the number of rows

https://en.xdnf.cn/q/72004.html

Related Q&A

How to detect if text is rotated 180 degrees or flipped upside down

I am working on a text recognition project. There is a chance the text is rotated 180 degrees. I have tried tesseract-ocr on terminal, but no luck. Is there any way to detect it and correct it? An exa…

Infinite loops using for in Python [duplicate]

This question already has answers here:Is there an expression for an infinite iterator?(7 answers)Closed 5 years ago.Why does this not create an infinite loop? a=5 for i in range(1,a):print(i)a=a+1or…

How to print the percentage of zipping a file python

I would like to get the percentage a file is at while zipping it. For instance it will print 1%, 2%, 3%, etc. I have no idea on where to start. How would I go about doing this right now I just have the…

kafka-python read from last produced message after a consumer restart

i am using kafka-python to consume messages from a kafka queue (kafka version 0.10.2.0). In particular i am using KafkaConsumer type. If the consumer stops and after a while it is restarted i would lik…

Python lib to Read a Flash swf Format File

Im interested in using Python to hack on the data in Flash swf files. There is good documentation available on the format of swf files, and I am considering writing my own Python lib to parse that dat…

PyQt5 Signals and Threading

I watched a short tutorial on PyQt4 signals on youtube and am having trouble getting a small sample program running. How do I connect my signal being emitted from a thread to the main window?import cp…

Pythons hasattr sometimes returns incorrect results

Why does hasattr say that the instance doesnt have a foo attribute?>>> class A(object): ... @property ... def foo(self): ... ErrorErrorError ... >>> a = A() >>…

Pure python library to read and write jpeg format

guys! Im looking for pure python implementation of jpeg writing (reading will be nice, but not necessary) library. Ive founded only TonyJPEG library port at http://mail.python.org/pipermail/image-sig/2…

Conda - unable to completely delete environment

I am using Windows 10 (all commands run as administrator). I created an environment called myenv. Then I used conda env remove -n myenvNow, if I tryconda info --envsI only see the base environment. How…

How to list all function names of a Python module in C++?

I have a C++ program, I want to import a Python module and list all function names in this module. How can I do it?I used the following code to get the dict from a module:PyDictObject* pDict = (PyDict…