Question 1

I have a large dataset where each line represents the value of a certain type (think a sensor) for a time interval (between start and end). It looks like this:

    start       end    type value
2015-01-01  2015-01-05  1   3
2015-01-06  2015-01-08  1   2
2015-01-05  2015-01-08  3   3
2015-01-13  2015-01-16  2   1

I want to turn it into a daily time-indexed frame like this:

day       type  value
2015-01-01  1   3
2015-01-02  1   3
2015-01-03  1   3
2015-01-04  1   3
2015-01-05  1   3
2015-01-06  1   2
2015-01-07  1   2
2015-01-08  1   2
2015-01-05  3   3
2015-01-16  3   3
2015-01-07  3   3
2015-01-08  3   3
2015-01-13  2   1
2015-01-14  2   1
2015-01-15  2   1
2015-01-16  2   1

(Note that we cannot make any assumption regarding the interval: they should be contiguous and non-overlapping but we cannot guarantee that)

Based on these Stack Overflow answers [1] (DataFrame resample on date ranges) [2] (pandas: Aggregate based on start/end date), there seem to exist two methods: one around itertuples, one around melt (2 above used stack/unstack but it is similar to melt). Let us compare them for performance.

# Creating a big enough dataframe
date_range = pd.date_range(start=dt.datetime(2015,1,1), end=dt.datetime(2019,12,31), freq='4D')
to_concat = []
for val in range(1,50):frame_tmp = pd.DataFrame()frame_tmp['start'] = date_rangeframe_tmp['end'] = frame_tmp['start']+ dt.timedelta(3)frame_tmp['type'] = valframe_tmp['value'] = np.random.randint(1, 6, frame_tmp.shape[0])to_concat.append(frame_tmp)
df = pd.concat(to_concat, ignore_index=True)# Method 1 
def method_1(df):df1 = (pd.concat([pd.Series(r.Index,pd.date_range(r.start,r.end,freq='D'))for r in df.itertuples()])) \.reset_index()df1.columns = ['start_2', 'idx']df2 = df1.set_index('idx').join(df).reset_index(drop=True)return df2.set_index('start_2')df_method_1=df.groupby(['type']).apply(method_1)# Method 2
df_tmp= df.reset_index()
df1 = (df_tmp.melt(df_tmp.columns.difference(['start','end']),['start', 'end'],value_name='current_time'))
df_method_2 = df1.set_index('current_time').groupby('index', group_keys=False)\
.resample('D').ffill()

With %%timeit in Jupyter, method 1 takes ~8s and method 2 takes ~25s for the dataframe defined as example. This is way too slow as the real dataset that I am dealing with is much bigger than this. On that dataframe, method 1 takes ~20 minutes.

Do you have any idea on how to make this faster?

Question 2

This is about 1.7 times faster than your method_1 and a little tidier:

df_expand = pd.DataFrame.from_records(((d, r.type, r.value) for r in df.itertuples()for d in pd.date_range(start=r.start, end=r.end, freq='D')),columns=['day', 'type', 'row']
)

You can get about 7 times faster by creating your own date range instead of calling pd.date_range():

one_day = dt.timedelta(1)
df_expand = pd.DataFrame.from_records(((r.start + i * one_day, r.type, r.value) for r in df.itertuples()for i in range(int((r.end-r.start)/one_day)+1)),columns=['day', 'type', 'row']
)

Or you can get up to 24 times faster by using numpy's arange function to generate the dates:

one_day = dt.timedelta(1)
df_expand = pd.DataFrame.from_records(((d, r.type, r.value) for r in df.itertuples()for d in np.arange(r.start.date(), r.end.date()+one_day, dtype='datetime64[D]')),columns=['day', 'type', 'row']
)

I couldn't resist adding one more that is a little more than twice as fast as the last one. Unfortunately, it's a lot harder to read. This groups the readings based on how many days they span ('dur'), then uses vectorized numpy operations to expand each group in a single batch.

def expand_group(g):dur = g.dur.iloc[0] # how many days for each reading in this group?return pd.DataFrame({'day': (g.start.values[:,None] + np.timedelta64(1, 'D') * np.arange(dur)).ravel(),'type': np.repeat(g.type.values, dur),'value': np.repeat(g.value.values, dur),})
# take all readings with the same duration and process them together using vectorized code
df_expand = (df.assign(dur=(df['end']-df['start']).dt.days + 1).groupby('dur').apply(expand_group).reset_index('dur', drop=True)
)

Update: Responding to your comment, below is a simplified version of the vectorized approach, which is faster and easier to read. Instead of using the groupby step, this makes a single matrix as wide as the longest reading, then filters out the unneeded entries. This should be pretty efficient unless the maximum duration for your readings is much longer than the average. With the test dataframe (all readings lasting 4 days), this is about 15x faster than the groupby solution and about 700x faster than method_1.

dur = (df['end']-df['start']).max().days + 1
df_expand = pd.DataFrame({'day': (df['start'].values[:,None] + np.timedelta64(1, 'D') * np.arange(dur)).ravel(),'type': np.repeat(df['type'].values, dur),'value': np.repeat(df['value'].values, dur),'end': np.repeat(df['end'].values, dur),
})
df_expand = df_expand.loc[df_expand['day']<=df_expand['end'], 'day':'value']

Performance issue turning rows with start - end into a dataframe with TimeIndex

Related Q&A

How can I create a key using RSA/ECB/PKCS1Padding in python?

Do full-outer-join with pandas.merge_asof

order of calling constructors in Python

How do I access data from a python thread

How to tell if a full-screen application is running?

Pretty printers for maps throwing a type error

Return PDF generated with FPDF in Flask

Tensorflow not found on pip install inside Docker Container using Mac M1

Fast fuse of close points in a numpy-2d (vectorized)

I use to_gbq on pandas for updating Google BigQuery and get GenericGBQException