Performance issue turning rows with start - end into a dataframe with TimeIndex

2024/9/21 20:31:27

I have a large dataset where each line represents the value of a certain type (think a sensor) for a time interval (between start and end). It looks like this:

    start       end    type value
2015-01-01  2015-01-05  1   3
2015-01-06  2015-01-08  1   2
2015-01-05  2015-01-08  3   3
2015-01-13  2015-01-16  2   1

I want to turn it into a daily time-indexed frame like this:

day       type  value
2015-01-01  1   3
2015-01-02  1   3
2015-01-03  1   3
2015-01-04  1   3
2015-01-05  1   3
2015-01-06  1   2
2015-01-07  1   2
2015-01-08  1   2
2015-01-05  3   3
2015-01-16  3   3
2015-01-07  3   3
2015-01-08  3   3
2015-01-13  2   1
2015-01-14  2   1
2015-01-15  2   1
2015-01-16  2   1

(Note that we cannot make any assumption regarding the interval: they should be contiguous and non-overlapping but we cannot guarantee that)

Based on these Stack Overflow answers [1] (DataFrame resample on date ranges) [2] (pandas: Aggregate based on start/end date), there seem to exist two methods: one around itertuples, one around melt (2 above used stack/unstack but it is similar to melt). Let us compare them for performance.

# Creating a big enough dataframe
date_range = pd.date_range(start=dt.datetime(2015,1,1), end=dt.datetime(2019,12,31), freq='4D')
to_concat = []
for val in range(1,50):frame_tmp = pd.DataFrame()frame_tmp['start'] = date_rangeframe_tmp['end'] = frame_tmp['start']+ dt.timedelta(3)frame_tmp['type'] = valframe_tmp['value'] = np.random.randint(1, 6, frame_tmp.shape[0])to_concat.append(frame_tmp)
df = pd.concat(to_concat, ignore_index=True)# Method 1 
def method_1(df):df1 = (pd.concat([pd.Series(r.Index,pd.date_range(r.start,r.end,freq='D'))for r in df.itertuples()])) \.reset_index()df1.columns = ['start_2', 'idx']df2 = df1.set_index('idx').join(df).reset_index(drop=True)return df2.set_index('start_2')df_method_1=df.groupby(['type']).apply(method_1)# Method 2
df_tmp= df.reset_index()
df1 = (df_tmp.melt(df_tmp.columns.difference(['start','end']),['start', 'end'],value_name='current_time'))
df_method_2 = df1.set_index('current_time').groupby('index', group_keys=False)\
.resample('D').ffill()

With %%timeit in Jupyter, method 1 takes ~8s and method 2 takes ~25s for the dataframe defined as example. This is way too slow as the real dataset that I am dealing with is much bigger than this. On that dataframe, method 1 takes ~20 minutes.

Do you have any idea on how to make this faster?

Answer

This is about 1.7 times faster than your method_1 and a little tidier:

df_expand = pd.DataFrame.from_records(((d, r.type, r.value) for r in df.itertuples()for d in pd.date_range(start=r.start, end=r.end, freq='D')),columns=['day', 'type', 'row']
)

You can get about 7 times faster by creating your own date range instead of calling pd.date_range():

one_day = dt.timedelta(1)
df_expand = pd.DataFrame.from_records(((r.start + i * one_day, r.type, r.value) for r in df.itertuples()for i in range(int((r.end-r.start)/one_day)+1)),columns=['day', 'type', 'row']
)

Or you can get up to 24 times faster by using numpy's arange function to generate the dates:

one_day = dt.timedelta(1)
df_expand = pd.DataFrame.from_records(((d, r.type, r.value) for r in df.itertuples()for d in np.arange(r.start.date(), r.end.date()+one_day, dtype='datetime64[D]')),columns=['day', 'type', 'row']
)

I couldn't resist adding one more that is a little more than twice as fast as the last one. Unfortunately, it's a lot harder to read. This groups the readings based on how many days they span ('dur'), then uses vectorized numpy operations to expand each group in a single batch.

def expand_group(g):dur = g.dur.iloc[0] # how many days for each reading in this group?return pd.DataFrame({'day': (g.start.values[:,None] + np.timedelta64(1, 'D') * np.arange(dur)).ravel(),'type': np.repeat(g.type.values, dur),'value': np.repeat(g.value.values, dur),})
# take all readings with the same duration and process them together using vectorized code
df_expand = (df.assign(dur=(df['end']-df['start']).dt.days + 1).groupby('dur').apply(expand_group).reset_index('dur', drop=True)
)

Update: Responding to your comment, below is a simplified version of the vectorized approach, which is faster and easier to read. Instead of using the groupby step, this makes a single matrix as wide as the longest reading, then filters out the unneeded entries. This should be pretty efficient unless the maximum duration for your readings is much longer than the average. With the test dataframe (all readings lasting 4 days), this is about 15x faster than the groupby solution and about 700x faster than method_1.

dur = (df['end']-df['start']).max().days + 1
df_expand = pd.DataFrame({'day': (df['start'].values[:,None] + np.timedelta64(1, 'D') * np.arange(dur)).ravel(),'type': np.repeat(df['type'].values, dur),'value': np.repeat(df['value'].values, dur),'end': np.repeat(df['end'].values, dur),
})
df_expand = df_expand.loc[df_expand['day']<=df_expand['end'], 'day':'value']
https://en.xdnf.cn/q/72023.html

Related Q&A

How can I create a key using RSA/ECB/PKCS1Padding in python?

I am struggling to find any method of using RSA in ECB mode with PKCS1 padding in python. Ive looked into pyCrypto, but they dont have PKCS1 padding in the master branch (but do in a patch). Neverthel…

Do full-outer-join with pandas.merge_asof

Hi I need to align some time series data with nearest timestamps, so I think pandas.merge_asof could be a good candidate. However, it does not have an option to set how=outer like in the standard merge…

order of calling constructors in Python

#!/usr/bin/pythonclass Parent(object): # define parent classparentAttr = 100def __init__(self):print "Calling parent constructor"def parentMethod(self):print Calling parent methoddef s…

How do I access data from a python thread

I have a very simple threading example using Python 3.4.2. In this example I am creating a five threads that just returns the character string "Result" and appends it to an array titled thre…

How to tell if a full-screen application is running?

Is it possible in python to tell if a full screen application on linux is running? I have a feeling it might be possible using Xlib but I havent found a way.EDIT: By full screen I mean the WHOLE scree…

Pretty printers for maps throwing a type error

Ive configured pretty printers using http://wiki.eclipse.org/CDT/User/FAQ#How_can_I_inspect_the_contents_of_STL_containers.3F. It successfully works for vector and other containers. However I cant get …

Return PDF generated with FPDF in Flask

I can generate a PDF with an image using the code below. How can I return the generated PDF from a Flask route?from fpdf import FPDF pdf = FPDF() img = input(enter file name) g = img + .jpg pdf.add_p…

Tensorflow not found on pip install inside Docker Container using Mac M1

Im trying to run some projects using the new Mac M1. Those projects already work on Intel processor and are used by other developers that use Intel. I am not able to build this simple Dockerfile: FROM …

Fast fuse of close points in a numpy-2d (vectorized)

I have a question similar to the question asked here: simple way of fusing a few close points. I want to replace points that are located close to each other with the average of their coordinates. The c…

I use to_gbq on pandas for updating Google BigQuery and get GenericGBQException

While trying to use to_gbq for updating Google BigQuery table, I get a response of:GenericGBQException: Reason: 400 Error while reading data, error message: JSON table encountered too many errors, givi…