Groupby Pandas , calculate multiple columns based on date difference

2024/11/19 21:38:33

I have a pandas dataframe shown below:

CID RefID   Date        Group   MID 
100     1   1/01/2021       A                       
100     2   3/01/2021       A                       
100     3   4/01/2021       A   101             
100     4   15/01/2021      A                           
100     5   18/01/2021      A                   
200     6   3/03/2021       B                       
200     7   4/04/2021       B                       
200     8   9/04/2021       B   102             
200     9   25/04/2021      B                       
300     10  26/04/2021      C                       
300     11  27/05/2021      C           
300     12  28/05/2021      C   103 

I want to create three columns:

days_diff:

  1. This has to be created in a way that if the difference b/w the first Date and corresponding rows is greater than 30 belonging to the same CID then assign 'NAT' or 0 to the next row (reset) and then subtract the date with this row for the following values

  2. If MIDis not null and belong to same CID group assign 'NAT' or 0 to the next row (reset) and then subtract the date with this row for the following values

Otherwise just fetch the date difference b/w the first row belonging to the same CID for the corresponding rows

A: This depends on the days_diff column , this column is like a counter it will only change/increment when there's another NAT occurrence for the same CID and reset itself for every CID.

B: This column depends on the column A , if the value in A remains same it won't change otherwise increments

It's a bit complicated to explain please refer to the output below for reference. I have used .groupby() .diff() and .shift() methods to create multiple dummy columns in order to calculate this and still working on it, please let me know the best way to go about this, thanks

My expected output :

CID RefID   Date        Group   MID     days_diff       A   B
100     1   1/01/2021       A           NAT             1   1
100     2   3/01/2021       A           2 days          1   1
100     3   4/01/2021       A   101     3 days          1   1
100     4   15/01/2021      A           NAT             2   4
100     5   18/01/2021      A           3 days          2   4
200     6   3/03/2021       B           NAT             1   6
200     7   4/04/2021       B           NAT             2   7
200     8   9/04/2021       B   102     5 days          2   7
200     9   25/04/2021      B           NAT             3   9
300     10  26/04/2021      C           NAT             1   10
300     11  27/05/2021      C           NAT             2   11
300     12  28/05/2021      C   103     1 day           2   11
Answer

You could do something like this:

def days_diff(sdf):result = pd.DataFrame({"days_diff": pd.NaT, "A": None}, index=sdf.index)start = sdf.at[sdf.index[0], "Date"]for index, day, next_MID_is_na in zip(sdf.index[1:], sdf.Date[1:], sdf.MID.shift(1).isna()[1:]):diff = (day - start).daysif diff <= 30 and next_MID_is_na:result.at[index, "days_diff"] = diffelse:start = dayresult.A = result.days_diff.isna().cumsum()return resultdf[["days_diff", "A"]] = df[["CID", "Date", "MID"]].groupby("CID").apply(days_diff)
df["B"] = df.RefID.where(df.A != df.A.shift(1)).ffill()

Result for df created by

from io import StringIO
data = StringIO(
'''
CID RefID   Date        Group   MID 
100     1   1/01/2021       A                       
100     2   3/01/2021       A                       
100     3   4/01/2021       A   101             
100     4   15/01/2021      A                           
100     5   18/01/2021      A                   
200     6   3/03/2021       B                       
200     7   4/04/2021       B                       
200     8   9/04/2021       B   102             
200     9   25/04/2021      B                       
300     10  26/04/2021      C                       
300     11  27/05/2021      C           
300     12  28/05/2021      C   103
''')
df = pd.read_csv(data, delim_whitespace=True)
df.Date = pd.to_datetime(df.Date, format="%d/%m/%Y")

is

    CID  RefID       Date Group    MID days_diff  A     B
0   100      1 2021-01-01     A    NaN       NaT  1   1.0
1   100      2 2021-01-03     A    NaN         2  1   1.0
2   100      3 2021-01-04     A  101.0         3  1   1.0
3   100      4 2021-01-15     A    NaN       NaT  2   4.0
4   100      5 2021-01-18     A    NaN         3  2   4.0
5   200      6 2021-03-03     B    NaN       NaT  1   6.0
6   200      7 2021-04-04     B    NaN       NaT  2   7.0
7   200      8 2021-04-09     B  102.0         5  2   7.0
8   200      9 2021-04-25     B    NaN       NaT  3   9.0
9   300     10 2021-04-26     C    NaN       NaT  1  10.0
10  300     11 2021-05-27     C    NaN       NaT  2  11.0
11  300     12 2021-05-28     C  103.0         1  2  11.0

A few explanations:

  • The function days_diff produces a dataframe with the two columns days_diff and A. It is applied to the grouped by column CID sub-dataframes of df.
  • First step: Initializing the result dataframe result (column days_diff filled with NaT, column A with None), and setting the starting value start for the day differences to the first day in the group.
  • Afterwards essentially looping over the sub-dataframe after the first index, thereby grabbing the index, the value in column Date, and a boolean value next_MID_is_na that signifies if the value of the MID column in the next row ist NaN (via .shift(1).isna()).
  • In every step of the loop:
    1. Calculation of the difference of the current day to the start day.
    2. Checking the rules for the days_diff column:
      • If difference of current and start day <= 30 days and NaN in next MID-row -> day-difference.
      • Otherwise -> reset of start to the current day.
  • After finishing column days_diff calculation of column A: result.days_diff.isna() is True (== 1) when days_diff is NaN, False (== 0) otherwise. Therefore the cummulative sum (.cumsum()) gives the required result.
  • After the groupby-apply to produce the columns days_diff and A finally the calculation of column B: Selection of RefID-values where the values A change (via .where(df.A != df.A.shift(1))), and then forward filling the remaining NaNs.
https://en.xdnf.cn/q/118491.html

Related Q&A

EC2 .bashrc and .bash_profile re-setting

Reason Im asking: pycurl requires both libcurl-devel and openssl-devel. To install these, I have these two lines the my .bash_profile: sudo yum install libcurl-devel sudo yum install -y openssl-develPr…

How to load mutiple PPM files present in a folder as single Numpy ndarray?

The following Python code creates list of numpy array. I want to load by data sets as a numpy array that has dimension K x M x N x 3 , where K is the index of the image and M x N x 3 is the dimension …

Python: Understanding Threading Module

While learning Pythons threading module Ive run a simple test. Interesting that the threads are running sequentially and not parallel. Is it possible to modify this test code so a program executes the …

How the program has control with the break statement [closed]

Its difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying thi…

Python: how to plot data coming from different excel sheets in the same chart

I need to create an interactive chart on python taking data from different sheets of an Excel file. I tried to create a for loop to take data from all the sheets automatically, but I manage to graph on…

Dynamic html table with django [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.Want to improve this question? Update the question so it focuses on one problem only by editing this post.Closed 6…

Driving costs - functions in python- EOFerror in if __name__ == __main__:

I am stuck with this question: Write a function driving_cost() with input parameters miles_per_gallon, dollars_per_gallon, and miles_driven, that returns the dollar cost to drive those miles. All items…

Pandas loc dynamic conditional list

I have a Pandas DataFrame and I want to find all rows where the ith column values are 10 times greater than other columns. Here is an example of my DataFrame:For example, looking at column i=0, row B (…

no module named numpy python2.7

Im using python 2.7 on Linux CentOS 6.5. After successfully using yum to install numpy, I am unable to import the module.from numpy import *The above code produces the following error:no module named …

TypeError: list indices must be integers or slices, not tuple for list of tuples

I am getting "list indices must be integers or slices, not tuple" error while trying to generate list from list of tuples. list of tuples have the following structure:[(29208, 8, 8, 8), (2920…