Question 1

I have two timeseries dataframes df1 and df2:

df1 = pd.DataFrame({'date_1':['10/11/2017 0:00','10/11/2017 03:00','10/11/2017 06:00','10/11/2017 09:00'],'value_1':[5000,1500,np.nan,2000]})df1['date_1'] = pd.to_datetime(df1.date_1.astype(str), format='%m/%d/%Y %H:%M',errors ='coerce') 
df1.index = pd.DatetimeIndex(df1.date_1)
df1.drop('date_1', axis = 1, inplace = True)

&

df2 = pd.DataFrame({'date_2': ['2017-10-11 00:00:00', '2017-10-11 00:30:00','2017-10-11 00:50:00', '2017-10-11 01:20:00','2017-10-11 01:40:00','2017-10-11 02:20:00','2017-10-11 02:50:00', '2017-10-11 03:00:00','2017-10-11 03:20:00', '2017-10-11 03:50:00','2017-10-11 04:20:00', '2017-10-11 04:50:00','2017-10-11 05:20:00', '2017-10-11 05:50:00','2017-10-11 06:00:00', '2017-10-11 06:20:00','2017-10-11 06:50:00', '2017-10-11 07:20:00','2017-10-11 07:50:00', '2017-10-11 08:20:00','2017-10-11 08:50:00', '2017-10-11 09:20:00','2017-10-11 09:50:00', '2017-10-11 10:20:00'],'value_2':[1500.0, 2050.0,  np.nan,  2400.0, 2500.0,  2550.0,  2900.0,  np.nan,3200.0,  3500.0,  np.nan,  3600.0,2600.0,  2500.0,  2350.0,  2200.0,np.nan,  2100.0,  np.nan,  2400.0,2600.0,  np.nan,  8000.0,  9000.0]})
df2['date_2'] = pd.to_datetime(df2.date_2.astype(str), format='%Y-%m-%d %H:%M',errors ='coerce') 
df2.index = pd.DatetimeIndex(df2.date_2)
df2.drop('date_2', axis = 1, inplace = True)

Both dataframes are observations on the same day but with different time resolution. df1 has time resolution of 3 hours whereas df2 has time resolution of 30 minutes or less. I am interested to create a new dataframe dfx by comparing above dataframes with certain conditions, and create two columns count and duration in dfx.

firstly: look at df_2['value_2']
compare df_2['value_2'] with df_1['value_1']
if df_2['value_2']<2800 for a timestamp & df_1['value_1'] >1600 for a timestamp within nearest half of the resolution of df1 i.e. 01:30 we count the event as 1 otherwise 0.
e.g. for a timestamps of df2 00:00:00 - 01:30:00 compare df_2['value_2'] values with
df_1['value_1'] at 00:00:00
for a timestamps of df2 01:31:00 - 03:00:00 compare df_2['value_2'] values with
df_1['value_1'] at 03:00:00
for a timestamps of df2 03:00:00 - 04:30:00 compare df_2['value_2'] values with
df_1['value_1'] at 03:00:00
for a timestamps of df2 04:31:00 - 06:00:00 compare df_2['value_2'] values with
df_1['value_1'] at 06:00:00 and so on. where,
if df2['value_2] == np.nan for a timestamp t replace the nan value with average of values at timestampst-1 & t+1 and then make the comparison.
if df1['value_1] == np.nan for a timestamp t , give the corresponding count value 0.

For the duration column in dfx: dfx['duration] = df2.index[i+1] - df2.index[i] for count on marginal time stamps like 01:20:00, dfx['duration] = (df1.index[i] + 01:30) - df2.index[i] where. df1.index[i] is the timestamp of df1 with which comparison of df2 is made.

Desired output

dfx = pd.DataFrame({'date_2': ['2017-10-11 00:00:00', '2017-10-11 00:30:00','2017-10-11 00:50:00', '2017-10-11 01:20:00','2017-10-11 01:40:00','2017-10-11 02:20:00','2017-10-11 02:50:00', '2017-10-11 03:00:00','2017-10-11 03:20:00', '2017-10-11 03:50:00','2017-10-11 04:20:00', '2017-10-11 04:50:00','2017-10-11 05:20:00', '2017-10-11 05:50:00','2017-10-11 06:00:00', '2017-10-11 06:20:00','2017-10-11 06:50:00', '2017-10-11 07:20:00','2017-10-11 07:50:00', '2017-10-11 08:20:00','2017-10-11 08:50:00', '2017-10-11 09:20:00','2017-10-11 09:50:00', '2017-10-11 10:20:00'],'count':[1, 1,  1,  1, 0,  0,  0, 0,0,  0,  0,  0,0,  0,  0,  0,0,  0,  1,  1,1,  0,  0,  0],'duration':['00:30','00:20','00:30','00:10','00:00', '00:00', '00:00', '00:00','00:00', '00:00', '00:00', '00:00','00:00', '00:00', '00:00', '00:00','00:00', '00:00', '00:30', '00:30','00:10', '00:00', '00:00', '00:00']})dfx['date_2'] = pd.to_datetime(dfx.date_2.astype(str), format='%Y-%m-%d %H:%M',errors ='coerce') 
dfx.index = pd.DatetimeIndex(dfx.date_2)
dfx.drop('date_2', axis = 1, inplace = True)

My question has become quite long in spite of my desire to shorten it. Please, bear with it. I would highly appreciate your kind help.

Thanks!

Question 2

Input data:

>>> df1value_1
date_1
2017-10-11 00:00:00   5000.0
2017-10-11 03:00:00   1500.0
2017-10-11 06:00:00   1200.0
2017-10-11 09:00:00      NaN>>> df2value_2
date_2
2017-10-11 00:00:00   1500.0
2017-10-11 00:30:00   2050.0
2017-10-11 00:50:00      NaN
2017-10-11 01:20:00   2400.0
2017-10-11 01:40:00   2500.0
...
2017-10-11 08:20:00   2400.0
2017-10-11 08:50:00   2600.0
2017-10-11 09:20:00      NaN
2017-10-11 09:50:00   8000.0
2017-10-11 10:20:00   9000.0

Fill NaN value from df2 by linear interpolation between t-1 and t+1:

df2['value_2'] = df2['value_2'].interpolate()

Create an interval from df1 according to your rules:

ii = pd.IntervalIndex.from_tuples(list(zip(df1.index - pd.DateOffset(hours=1, minutes=29),df1.index + pd.DateOffset(hours=1, minutes=30))))

Bin values into discrete intervals:

df1['interval'] = pd.cut(df1.index, bins=ii)
df2['interval'] = pd.cut(df2.index, bins=ii)

Merge the two dataframes on interval:

dfx = pd.merge(df2, df1, on='interval', how='left').set_index('interval')
dfx = (dfx['value_2'].lt(2800) & dfx['value_1'].gt(1600)) \.astype(int).to_frame('count').set_index(df2.index)

Append index of df1 with as a freq of 90 minutes:

dti = df2.index.append(pd.DatetimeIndex(df1.index.to_series().resample('90T').groups.keys())).sort_values().drop_duplicates()
dfx = dfx.reindex(dti).ffill().astype(int)

Compute duration from count and reindex from df2:

dfx['duration'] = dfx.index.to_series().diff(-1).abs() \.fillna(pd.Timedelta(0)).dt.components \.apply(lambda x: f"{x['hours']:02}:{x['minutes']:02}",axis='columns')dfx.loc[dfx['count'] == 0, 'duration'] = '00:00'
dfx = dfx.reindex(df2.index)

Output result:

>>> dfxcount duration
date_2
2017-10-11 00:00:00      1    00:30
2017-10-11 00:30:00      1    00:20
2017-10-11 00:50:00      1    00:30
2017-10-11 01:20:00      1    00:10
2017-10-11 01:40:00      0    00:00
2017-10-11 02:20:00      0    00:00
2017-10-11 02:50:00      0    00:00
2017-10-11 03:00:00      0    00:00
2017-10-11 03:20:00      0    00:00
2017-10-11 03:50:00      0    00:00
2017-10-11 04:20:00      0    00:00
2017-10-11 04:50:00      0    00:00
2017-10-11 05:20:00      0    00:00
2017-10-11 05:50:00      0    00:00
2017-10-11 06:00:00      0    00:00
2017-10-11 06:20:00      0    00:00
2017-10-11 06:50:00      0    00:00
2017-10-11 07:20:00      0    00:00
2017-10-11 07:50:00      1    00:30
2017-10-11 08:20:00      1    00:30
2017-10-11 08:50:00      1    00:10
2017-10-11 09:20:00      0    00:00
2017-10-11 09:50:00      0    00:00
2017-10-11 10:20:00      0    00:00

comparing two timeseries dataframes based on some conditions in pandas

Related Q&A

Game of Chance in Python 3.x?

Count occurence of a word by ID in python

ModuleNotFoundError: No module named plyer in Python

Floating point to 16 bit Twos Complement Binary, Python

Flag the first non zero column value with 1 and rest 0 having multiple columns

How to split up data from a column in a csv file into two separate output csv files?

Discord.py spellcheck commands

Django Model Form doesnt seem to validate the BooleanField

For loop only shows the first object

IndexError: pop from empty list