I have a large dataset where each line represents the value of a certain type (think a sensor) for a time interval (between start and end).
It looks like this:
start end type value
2015-01-01 2015-01-05 1 3
2015-01-06 2015-01-08 1 2
2015-01-05 2015-01-08 3 3
2015-01-13 2015-01-16 2 1
I want to turn it into a daily time-indexed frame like this:
day type value
2015-01-01 1 3
2015-01-02 1 3
2015-01-03 1 3
2015-01-04 1 3
2015-01-05 1 3
2015-01-06 1 2
2015-01-07 1 2
2015-01-08 1 2
2015-01-05 3 3
2015-01-16 3 3
2015-01-07 3 3
2015-01-08 3 3
2015-01-13 2 1
2015-01-14 2 1
2015-01-15 2 1
2015-01-16 2 1
(Note that we cannot make any assumption regarding the interval: they should be contiguous and non-overlapping but we cannot guarantee that)
Based on these Stack Overflow answers [1] (DataFrame resample on date ranges) [2] (pandas: Aggregate based on start/end date), there seem to exist two methods: one around itertuples, one around melt (2 above used stack/unstack but it is similar to melt).
Let us compare them for performance.
# Creating a big enough dataframe
date_range = pd.date_range(start=dt.datetime(2015,1,1), end=dt.datetime(2019,12,31), freq='4D')
to_concat = []
for val in range(1,50):frame_tmp = pd.DataFrame()frame_tmp['start'] = date_rangeframe_tmp['end'] = frame_tmp['start']+ dt.timedelta(3)frame_tmp['type'] = valframe_tmp['value'] = np.random.randint(1, 6, frame_tmp.shape[0])to_concat.append(frame_tmp)
df = pd.concat(to_concat, ignore_index=True)# Method 1
def method_1(df):df1 = (pd.concat([pd.Series(r.Index,pd.date_range(r.start,r.end,freq='D'))for r in df.itertuples()])) \.reset_index()df1.columns = ['start_2', 'idx']df2 = df1.set_index('idx').join(df).reset_index(drop=True)return df2.set_index('start_2')df_method_1=df.groupby(['type']).apply(method_1)# Method 2
df_tmp= df.reset_index()
df1 = (df_tmp.melt(df_tmp.columns.difference(['start','end']),['start', 'end'],value_name='current_time'))
df_method_2 = df1.set_index('current_time').groupby('index', group_keys=False)\
.resample('D').ffill()
With %%timeit
in Jupyter, method 1 takes ~8s and method 2 takes ~25s for the dataframe defined as example. This is way too slow as the real dataset that I am dealing with is much bigger than this. On that dataframe, method 1 takes ~20 minutes.
Do you have any idea on how to make this faster?
This is about 1.7 times faster than your method_1
and a little tidier:
df_expand = pd.DataFrame.from_records(((d, r.type, r.value) for r in df.itertuples()for d in pd.date_range(start=r.start, end=r.end, freq='D')),columns=['day', 'type', 'row']
)
You can get about 7 times faster by creating your own date range instead of calling pd.date_range()
:
one_day = dt.timedelta(1)
df_expand = pd.DataFrame.from_records(((r.start + i * one_day, r.type, r.value) for r in df.itertuples()for i in range(int((r.end-r.start)/one_day)+1)),columns=['day', 'type', 'row']
)
Or you can get up to 24 times faster by using numpy's arange
function to generate the dates:
one_day = dt.timedelta(1)
df_expand = pd.DataFrame.from_records(((d, r.type, r.value) for r in df.itertuples()for d in np.arange(r.start.date(), r.end.date()+one_day, dtype='datetime64[D]')),columns=['day', 'type', 'row']
)
I couldn't resist adding one more that is a little more than twice as fast as the last one. Unfortunately, it's a lot harder to read. This groups the readings based on how many days they span ('dur'), then uses vectorized numpy operations to expand each group in a single batch.
def expand_group(g):dur = g.dur.iloc[0] # how many days for each reading in this group?return pd.DataFrame({'day': (g.start.values[:,None] + np.timedelta64(1, 'D') * np.arange(dur)).ravel(),'type': np.repeat(g.type.values, dur),'value': np.repeat(g.value.values, dur),})
# take all readings with the same duration and process them together using vectorized code
df_expand = (df.assign(dur=(df['end']-df['start']).dt.days + 1).groupby('dur').apply(expand_group).reset_index('dur', drop=True)
)
Update: Responding to your comment, below is a simplified version of the vectorized approach, which is faster and easier to read. Instead of using the groupby
step, this makes a single matrix as wide as the longest reading, then filters out the unneeded entries. This should be pretty efficient unless the maximum duration for your readings is much longer than the average. With the test dataframe (all readings lasting 4 days), this is about 15x faster than the groupby
solution and about 700x faster than method_1
.
dur = (df['end']-df['start']).max().days + 1
df_expand = pd.DataFrame({'day': (df['start'].values[:,None] + np.timedelta64(1, 'D') * np.arange(dur)).ravel(),'type': np.repeat(df['type'].values, dur),'value': np.repeat(df['value'].values, dur),'end': np.repeat(df['end'].values, dur),
})
df_expand = df_expand.loc[df_expand['day']<=df_expand['end'], 'day':'value']