I have a pandas dataframe shown below:
CID RefID Date Group MID
100 1 1/01/2021 A
100 2 3/01/2021 A
100 3 4/01/2021 A 101
100 4 15/01/2021 A
100 5 18/01/2021 A
200 6 3/03/2021 B
200 7 4/04/2021 B
200 8 9/04/2021 B 102
200 9 25/04/2021 B
300 10 26/04/2021 C
300 11 27/05/2021 C
300 12 28/05/2021 C 103
I want to create three columns:
days_diff:
This has to be created in a way that if the difference b/w the first Date and corresponding rows is greater than 30 belonging to the same CID then assign 'NAT' or 0 to the next row (reset) and then subtract the date with this row for the following values
If MIDis not null and belong to same CID group assign 'NAT' or 0 to the next row (reset) and then subtract the date with this row for the following values
Otherwise just fetch the date difference b/w the first row belonging to the same CID for the corresponding rows
A: This depends on the days_diff column , this column is like a counter it will only change/increment when there's another NAT occurrence for the same CID and reset itself for every CID.
B: This column depends on the column A , if the value in A remains same it won't change otherwise increments
It's a bit complicated to explain please refer to the output below for reference. I have used .groupby()
.diff()
and .shift()
methods to create multiple dummy columns in order to calculate this and still working on it, please let me know the best way to go about this, thanks
My expected output :
CID RefID Date Group MID days_diff A B
100 1 1/01/2021 A NAT 1 1
100 2 3/01/2021 A 2 days 1 1
100 3 4/01/2021 A 101 3 days 1 1
100 4 15/01/2021 A NAT 2 4
100 5 18/01/2021 A 3 days 2 4
200 6 3/03/2021 B NAT 1 6
200 7 4/04/2021 B NAT 2 7
200 8 9/04/2021 B 102 5 days 2 7
200 9 25/04/2021 B NAT 3 9
300 10 26/04/2021 C NAT 1 10
300 11 27/05/2021 C NAT 2 11
300 12 28/05/2021 C 103 1 day 2 11