I have a df
with over hundreds of millions of rows.
latitude longitude time VAL
0 -39.20000076293945312500 140.80000305175781250000 1972-01-19 13:00:00 1.20000004768371582031
1 -39.20000076293945312500 140.80000305175781250000 1972-01-20 13:00:00 0.89999997615814208984
2 -39.20000076293945312500 140.80000305175781250000 1972-01-21 13:00:00 1.50000000000000000000
3 -39.20000076293945312500 140.80000305175781250000 1972-01-22 13:00:00 1.60000002384185791016
4 -39.20000076293945312500 140.80000305175781250000 1972-01-23 13:00:00 1.20000004768371582031
... ...
It contains a time
column with the type of datetime64
in UTC. The following code is to create a new column isInDST
to indicate if the time
is in daylight saving period in a local time zone.
df['isInDST'] = pd.DatetimeIndex(df['time']).tz_localize('UTC').tz_convert('Australia/Victoria').map(lambda x : x.dst().total_seconds()!=0)
It takes about 400 seconds to process 15,223,160 rows.
Is there a better approach to achieve this with better performance? Is vectorize
a better way?
All results are calculated on 1M datapoints.
Cython + np.vectorize
7.2 times faster than the original code
%%cython
from cpython.datetime cimport datetime
cpdef bint c_is_in_dst(datetime dt):return dt.dst().total_seconds() != 0 %%timeit
df['isInDST'] = np.vectorize(c_is_in_dst)(df['time'].dt.tz_localize('UTC').dt.tz_convert('Australia/Victoria').dt.to_pydatetime())
1.08 s ± 10.2 ms per loop per loop
np.vectorize
6.5 times faster than the original code
def is_in_dst(dt):return dt.dst().total_seconds() != 0 %%timeit
df['isInDST'] = np.vectorize(is_in_dst)(df['time'].dt.tz_localize('UTC').dt.tz_convert('Australia/Victoria').dt.to_pydatetime())
1.2 s ± 29.3 ms per loop per loop
Based on the documentation (The implementation is essentially a for loop
) I expected the result to be the same as for the list comprehension, but it's consistently a little bit better than list comprehension.
List comprehension
5.9 times faster than the original code
%%timeit
df['isInDST'] = [x.dst().total_seconds()!=0 for x in pd.DatetimeIndex(df['time']).tz_localize('UTC').tz_convert('Australia/Victoria')]
1.33 s ± 48.4 ms per loop
This result shows that pandas map
/apply
is very slow, it adds additional overhead that can be eliminated by just using a python for loop.
Original approach (map
on pandas DatetimeIndex)
%%timeit
df['isInDST'] = pd.DatetimeIndex(df['time']).tz_localize('UTC').tz_convert('Australia/Victoria').map(lambda x : x.dst().total_seconds()!=0)
7.82 s ± 84.3 ms per loop
Tested on 1M rows of dummy data
N = 1_000_000
df = pd.DataFrame({"time": [datetime.datetime.now().replace(hour=random.randint(0,23),minute=random.randint(0,59)) for _ in range(N)]})
Also, run the code on 100K and 10M rows - the results are linearly dependant on the number of rows