comparing two timeseries dataframes based on some conditions in pandas

2024/11/14 12:52:52

I have two timeseries dataframes df1 and df2:

df1 = pd.DataFrame({'date_1':['10/11/2017 0:00','10/11/2017 03:00','10/11/2017 06:00','10/11/2017 09:00'],'value_1':[5000,1500,np.nan,2000]})df1['date_1'] = pd.to_datetime(df1.date_1.astype(str), format='%m/%d/%Y %H:%M',errors ='coerce') 
df1.index = pd.DatetimeIndex(df1.date_1)
df1.drop('date_1', axis = 1, inplace = True)

&

df2 = pd.DataFrame({'date_2': ['2017-10-11 00:00:00', '2017-10-11 00:30:00','2017-10-11 00:50:00', '2017-10-11 01:20:00','2017-10-11 01:40:00','2017-10-11 02:20:00','2017-10-11 02:50:00', '2017-10-11 03:00:00','2017-10-11 03:20:00', '2017-10-11 03:50:00','2017-10-11 04:20:00', '2017-10-11 04:50:00','2017-10-11 05:20:00', '2017-10-11 05:50:00','2017-10-11 06:00:00', '2017-10-11 06:20:00','2017-10-11 06:50:00', '2017-10-11 07:20:00','2017-10-11 07:50:00', '2017-10-11 08:20:00','2017-10-11 08:50:00', '2017-10-11 09:20:00','2017-10-11 09:50:00', '2017-10-11 10:20:00'],'value_2':[1500.0, 2050.0,  np.nan,  2400.0, 2500.0,  2550.0,  2900.0,  np.nan,3200.0,  3500.0,  np.nan,  3600.0,2600.0,  2500.0,  2350.0,  2200.0,np.nan,  2100.0,  np.nan,  2400.0,2600.0,  np.nan,  8000.0,  9000.0]})
df2['date_2'] = pd.to_datetime(df2.date_2.astype(str), format='%Y-%m-%d %H:%M',errors ='coerce') 
df2.index = pd.DatetimeIndex(df2.date_2)
df2.drop('date_2', axis = 1, inplace = True)

Both dataframes are observations on the same day but with different time resolution. df1 has time resolution of 3 hours whereas df2 has time resolution of 30 minutes or less. I am interested to create a new dataframe dfx by comparing above dataframes with certain conditions, and create two columns count and duration in dfx.

  • firstly: look at df_2['value_2']
  • compare df_2['value_2'] with df_1['value_1']
  • if df_2['value_2']<2800 for a timestamp & df_1['value_1'] >1600 for a timestamp within nearest half of the resolution of df1 i.e. 01:30 we count the event as 1 otherwise 0.
  • e.g. for a timestamps of df2 00:00:00 - 01:30:00 compare df_2['value_2'] values with
    df_1['value_1'] at 00:00:00
  • for a timestamps of df2 01:31:00 - 03:00:00 compare df_2['value_2'] values with
    df_1['value_1'] at 03:00:00
  • for a timestamps of df2 03:00:00 - 04:30:00 compare df_2['value_2'] values with
    df_1['value_1'] at 03:00:00
  • for a timestamps of df2 04:31:00 - 06:00:00 compare df_2['value_2'] values with
    df_1['value_1'] at 06:00:00 and so on. where,
  • if df2['value_2] == np.nan for a timestamp t replace the nan value with average of values at timestampst-1 & t+1 and then make the comparison.
  • if df1['value_1] == np.nan for a timestamp t , give the corresponding count value 0.

For the duration column in dfx: dfx['duration] = df2.index[i+1] - df2.index[i] for count on marginal time stamps like 01:20:00, dfx['duration] = (df1.index[i] + 01:30) - df2.index[i] where. df1.index[i] is the timestamp of df1 with which comparison of df2 is made.

Desired output

dfx = pd.DataFrame({'date_2': ['2017-10-11 00:00:00', '2017-10-11 00:30:00','2017-10-11 00:50:00', '2017-10-11 01:20:00','2017-10-11 01:40:00','2017-10-11 02:20:00','2017-10-11 02:50:00', '2017-10-11 03:00:00','2017-10-11 03:20:00', '2017-10-11 03:50:00','2017-10-11 04:20:00', '2017-10-11 04:50:00','2017-10-11 05:20:00', '2017-10-11 05:50:00','2017-10-11 06:00:00', '2017-10-11 06:20:00','2017-10-11 06:50:00', '2017-10-11 07:20:00','2017-10-11 07:50:00', '2017-10-11 08:20:00','2017-10-11 08:50:00', '2017-10-11 09:20:00','2017-10-11 09:50:00', '2017-10-11 10:20:00'],'count':[1, 1,  1,  1, 0,  0,  0, 0,0,  0,  0,  0,0,  0,  0,  0,0,  0,  1,  1,1,  0,  0,  0],'duration':['00:30','00:20','00:30','00:10','00:00', '00:00', '00:00', '00:00','00:00', '00:00', '00:00', '00:00','00:00', '00:00', '00:00', '00:00','00:00', '00:00', '00:30', '00:30','00:10', '00:00', '00:00', '00:00']})dfx['date_2'] = pd.to_datetime(dfx.date_2.astype(str), format='%Y-%m-%d %H:%M',errors ='coerce') 
dfx.index = pd.DatetimeIndex(dfx.date_2)
dfx.drop('date_2', axis = 1, inplace = True)

My question has become quite long in spite of my desire to shorten it. Please, bear with it. I would highly appreciate your kind help.

Thanks!

Answer

Input data:

>>> df1value_1
date_1
2017-10-11 00:00:00   5000.0
2017-10-11 03:00:00   1500.0
2017-10-11 06:00:00   1200.0
2017-10-11 09:00:00      NaN>>> df2value_2
date_2
2017-10-11 00:00:00   1500.0
2017-10-11 00:30:00   2050.0
2017-10-11 00:50:00      NaN
2017-10-11 01:20:00   2400.0
2017-10-11 01:40:00   2500.0
...
2017-10-11 08:20:00   2400.0
2017-10-11 08:50:00   2600.0
2017-10-11 09:20:00      NaN
2017-10-11 09:50:00   8000.0
2017-10-11 10:20:00   9000.0
  1. Fill NaN value from df2 by linear interpolation between t-1 and t+1:
df2['value_2'] = df2['value_2'].interpolate()
  1. Create an interval from df1 according to your rules:
ii = pd.IntervalIndex.from_tuples(list(zip(df1.index - pd.DateOffset(hours=1, minutes=29),df1.index + pd.DateOffset(hours=1, minutes=30))))
  1. Bin values into discrete intervals:
df1['interval'] = pd.cut(df1.index, bins=ii)
df2['interval'] = pd.cut(df2.index, bins=ii)
  1. Merge the two dataframes on interval:
dfx = pd.merge(df2, df1, on='interval', how='left').set_index('interval')
dfx = (dfx['value_2'].lt(2800) & dfx['value_1'].gt(1600)) \.astype(int).to_frame('count').set_index(df2.index)
  1. Append index of df1 with as a freq of 90 minutes:
dti = df2.index.append(pd.DatetimeIndex(df1.index.to_series().resample('90T').groups.keys())).sort_values().drop_duplicates()
dfx = dfx.reindex(dti).ffill().astype(int)
  1. Compute duration from count and reindex from df2:
dfx['duration'] = dfx.index.to_series().diff(-1).abs() \.fillna(pd.Timedelta(0)).dt.components \.apply(lambda x: f"{x['hours']:02}:{x['minutes']:02}",axis='columns')dfx.loc[dfx['count'] == 0, 'duration'] = '00:00'
dfx = dfx.reindex(df2.index)

Output result:

>>> dfxcount duration
date_2
2017-10-11 00:00:00      1    00:30
2017-10-11 00:30:00      1    00:20
2017-10-11 00:50:00      1    00:30
2017-10-11 01:20:00      1    00:10
2017-10-11 01:40:00      0    00:00
2017-10-11 02:20:00      0    00:00
2017-10-11 02:50:00      0    00:00
2017-10-11 03:00:00      0    00:00
2017-10-11 03:20:00      0    00:00
2017-10-11 03:50:00      0    00:00
2017-10-11 04:20:00      0    00:00
2017-10-11 04:50:00      0    00:00
2017-10-11 05:20:00      0    00:00
2017-10-11 05:50:00      0    00:00
2017-10-11 06:00:00      0    00:00
2017-10-11 06:20:00      0    00:00
2017-10-11 06:50:00      0    00:00
2017-10-11 07:20:00      0    00:00
2017-10-11 07:50:00      1    00:30
2017-10-11 08:20:00      1    00:30
2017-10-11 08:50:00      1    00:10
2017-10-11 09:20:00      0    00:00
2017-10-11 09:50:00      0    00:00
2017-10-11 10:20:00      0    00:00
https://en.xdnf.cn/q/119715.html

Related Q&A

Game of Chance in Python 3.x?

I have this problem in my python code which is a coinflip game, the problem is that when It asks, "Heads or Tails?" and I just say 1 or Heads(same for 2 and Tails) without quotation marks an…

Count occurence of a word by ID in python

Following is the content of a file,My question is how to count the number of occurences for the word "optimus" for different IDs ID67 DATEUID Thank you for choosing Optimus prime. Please w…

ModuleNotFoundError: No module named plyer in Python

I am trying to write a program notify.py (location: desktop) that uses plyer library to get a notification on windows 10. I used pip install plyer and am using vs code to run the program but I get an e…

Floating point to 16 bit Twos Complement Binary, Python

so I think questions like this have been asked before but Im having quite a bit of trouble getting this implemented. Im dealing with CSV files that contain floating points between -1 and 1. All of thes…

Flag the first non zero column value with 1 and rest 0 having multiple columns

Please assist with the belowimport pandas as pd df = pd.DataFrame({Grp: [1,1,1,1,2,2,2,2,3,3,3,4,4,4], Org1: [x,x,y,y,z,y,z,z,x,y,y,z,x,x], Org2: [a,a,b,b,c,b,c,c,a,b,b,c,a,a], Value: [0,0,3,1,0,1,0,5,…

How to split up data from a column in a csv file into two separate output csv files?

I have a .csv file, e.g.:ID NAME CATEGORIES 1, x, AB 2, xx, AA 3, xxx, BAHow would I get this to form two output .csv files based on the category e.g.:File 1:ID NAME CATEGORY 1, x, A 2, xx, A 3, …

Discord.py spellcheck commands

Recently, I looked up Stack Overflow and found this code which can check for potential typos: from difflib import SequenceMatcher SequenceMatcher(None, "help", "hepl").ratio() # Ret…

Django Model Form doesnt seem to validate the BooleanField

In my model the validation is not validating for the boolean field, only one time product_field need to be checked , if two time checked raise validation error.product_field = models.BooleanField(defau…

For loop only shows the first object

I have a code that loops through a list of mails, but it is only showing the first result, even though there are also other matches. The other results require me to loop over the mails again only to re…

IndexError: pop from empty list

I need help. I have no idea why I am getting this error. The error is in fname = 1st.pop()for i in range(num) :fname = lst.pop()lTransfer = [(os.path.join(src, fname), os.path.join(dst, fna…