I have two timeseries dataframes df1
and df2
:
df1 = pd.DataFrame({'date_1':['10/11/2017 0:00','10/11/2017 03:00','10/11/2017 06:00','10/11/2017 09:00'],'value_1':[5000,1500,np.nan,2000]})df1['date_1'] = pd.to_datetime(df1.date_1.astype(str), format='%m/%d/%Y %H:%M',errors ='coerce')
df1.index = pd.DatetimeIndex(df1.date_1)
df1.drop('date_1', axis = 1, inplace = True)
&
df2 = pd.DataFrame({'date_2': ['2017-10-11 00:00:00', '2017-10-11 00:30:00','2017-10-11 00:50:00', '2017-10-11 01:20:00','2017-10-11 01:40:00','2017-10-11 02:20:00','2017-10-11 02:50:00', '2017-10-11 03:00:00','2017-10-11 03:20:00', '2017-10-11 03:50:00','2017-10-11 04:20:00', '2017-10-11 04:50:00','2017-10-11 05:20:00', '2017-10-11 05:50:00','2017-10-11 06:00:00', '2017-10-11 06:20:00','2017-10-11 06:50:00', '2017-10-11 07:20:00','2017-10-11 07:50:00', '2017-10-11 08:20:00','2017-10-11 08:50:00', '2017-10-11 09:20:00','2017-10-11 09:50:00', '2017-10-11 10:20:00'],'value_2':[1500.0, 2050.0, np.nan, 2400.0, 2500.0, 2550.0, 2900.0, np.nan,3200.0, 3500.0, np.nan, 3600.0,2600.0, 2500.0, 2350.0, 2200.0,np.nan, 2100.0, np.nan, 2400.0,2600.0, np.nan, 8000.0, 9000.0]})
df2['date_2'] = pd.to_datetime(df2.date_2.astype(str), format='%Y-%m-%d %H:%M',errors ='coerce')
df2.index = pd.DatetimeIndex(df2.date_2)
df2.drop('date_2', axis = 1, inplace = True)
Both dataframes are observations on the same day but with different time resolution. df1
has time resolution of 3 hours
whereas df2
has time resolution of 30 minutes
or less.
I am interested to create a new dataframe dfx
by comparing above dataframes with certain conditions, and create two columns count
and duration
in dfx
.
- firstly: look at
df_2['value_2']
- compare
df_2['value_2']
withdf_1['value_1']
- if
df_2['value_2']<2800
for a timestamp &df_1['value_1'] >1600
for a timestamp within nearest half of the resolution ofdf1
i.e.01:30
we count theevent
as1
otherwise0
. - e.g. for a timestamps of
df2
00:00:00 - 01:30:00
comparedf_2['value_2']
values with
df_1['value_1']
at00:00:00
- for a timestamps of
df2
01:31:00 - 03:00:00
comparedf_2['value_2']
values with
df_1['value_1']
at03:00:00
- for a timestamps of
df2
03:00:00 - 04:30:00
comparedf_2['value_2']
values with
df_1['value_1']
at03:00:00
- for a timestamps of
df2
04:31:00 - 06:00:00
comparedf_2['value_2']
values with
df_1['value_1']
at06:00:00
and so on. where, - if
df2['value_2] == np.nan
for a timestampt
replace thenan
value with average of values at timestampst-1 & t+1
and then make the comparison. - if
df1['value_1] == np.nan
for a timestampt
, give the correspondingcount
value 0.
For the duration
column in dfx
:
dfx['duration] = df2.index[i+1] - df2.index[i]
for count
on marginal time stamps like 01:20:00
,
dfx['duration] = (df1.index[i] + 01:30) - df2.index[i]
where. df1.index[i]
is the timestamp of df1
with which comparison of df2 is made.
Desired output
dfx = pd.DataFrame({'date_2': ['2017-10-11 00:00:00', '2017-10-11 00:30:00','2017-10-11 00:50:00', '2017-10-11 01:20:00','2017-10-11 01:40:00','2017-10-11 02:20:00','2017-10-11 02:50:00', '2017-10-11 03:00:00','2017-10-11 03:20:00', '2017-10-11 03:50:00','2017-10-11 04:20:00', '2017-10-11 04:50:00','2017-10-11 05:20:00', '2017-10-11 05:50:00','2017-10-11 06:00:00', '2017-10-11 06:20:00','2017-10-11 06:50:00', '2017-10-11 07:20:00','2017-10-11 07:50:00', '2017-10-11 08:20:00','2017-10-11 08:50:00', '2017-10-11 09:20:00','2017-10-11 09:50:00', '2017-10-11 10:20:00'],'count':[1, 1, 1, 1, 0, 0, 0, 0,0, 0, 0, 0,0, 0, 0, 0,0, 0, 1, 1,1, 0, 0, 0],'duration':['00:30','00:20','00:30','00:10','00:00', '00:00', '00:00', '00:00','00:00', '00:00', '00:00', '00:00','00:00', '00:00', '00:00', '00:00','00:00', '00:00', '00:30', '00:30','00:10', '00:00', '00:00', '00:00']})dfx['date_2'] = pd.to_datetime(dfx.date_2.astype(str), format='%Y-%m-%d %H:%M',errors ='coerce')
dfx.index = pd.DatetimeIndex(dfx.date_2)
dfx.drop('date_2', axis = 1, inplace = True)
My question has become quite long in spite of my desire to shorten it. Please, bear with it. I would highly appreciate your kind help.
Thanks!