I have a dataframe that looks something like this (actual dataframe is millions of rows):
ID | Category | Site | Task Completed | Access Completed |
---|---|---|---|---|
1 | A | X | 1/2/22 12:00:00AM | 1/1/22 12:00:00 AM |
1 | A | Y | 1/3/22 12:00:00AM | 1/2/22 12:00:00 AM |
1 | A | X | 1/1/22 1:00:00AM | 1/1/22 12:00:00 AM |
1 | B | X | 1/1/22 1:00:00AM | 1/1/22 12:00:00 AM |
2 | A | X | 1/2/22 12:00:00AM | 1/1/22 12:00:00 AM |
2 | A | X | 1/1/22 1:00:00AM | 1/1/22 12:00:00 AM |
A few things to point out about this dataframe. First, the 'access completed' date is constant for a ID/category/site combination. So, for ID 1, Cat A, Site X, the access completed will always be 1/1/22 12:00:00 AM no matter how many 'task completed' there are for that combo. Second, there can be multiple task completed dates for a ID/Category/Site combination.
What I want to find is the time difference (in hours) between Access Completed and first Task Completed for every ID/Category/Site combination within the dataset. The intended result would look something like this:
ID | Category | Site | Time Difference |
---|---|---|---|
1 | A | X | 1 |
1 | A | Y | 24 |
1 | B | X | 1 |
2 | A | X | 1 |
Thanks so much for your help.