Question 1

I have a dataframe that looks something like this (actual dataframe is millions of rows):

ID	Category	Site	Task Completed	Access Completed
1	A	X	1/2/22 12:00:00AM	1/1/22 12:00:00 AM
1	A	Y	1/3/22 12:00:00AM	1/2/22 12:00:00 AM
1	A	X	1/1/22 1:00:00AM	1/1/22 12:00:00 AM
1	B	X	1/1/22 1:00:00AM	1/1/22 12:00:00 AM
2	A	X	1/2/22 12:00:00AM	1/1/22 12:00:00 AM
2	A	X	1/1/22 1:00:00AM	1/1/22 12:00:00 AM

A few things to point out about this dataframe. First, the 'access completed' date is constant for a ID/category/site combination. So, for ID 1, Cat A, Site X, the access completed will always be 1/1/22 12:00:00 AM no matter how many 'task completed' there are for that combo. Second, there can be multiple task completed dates for a ID/Category/Site combination.

What I want to find is the time difference (in hours) between Access Completed and first Task Completed for every ID/Category/Site combination within the dataset. The intended result would look something like this:

ID	Category	Site	Time Difference
1	A	X	1
1	A	Y	24
1	B	X	1
2	A	X	1

Thanks so much for your help.

Question 2

Try as follows:

First, use pd.to_datetime to turn date/time strings into datetime.
Sort the df on column "Task Completed", apply df.groupby and get .first.
Now, subtract "Access Completed" from "Task Completed" using Series.sub, and use .dt.total_seconds on the Timedelta values to get all seconds. Since we want hours, we apply .div(3600).
Lastly, reset the index and use df.rename to assign a meaningful column name.

df[['Task Completed','Access Completed']] = \df[['Task Completed','Access Completed']].apply(lambda x: pd.to_datetime(x))res = df.sort_values('Task Completed').groupby(['ID','Category','Site']).first()
res = res['Task Completed'].sub(res['Access Completed'])\.dt.total_seconds().div(3600).reset_index(drop=False).rename(columns={0:'Time Difference'})print(res)ID Category Site  Time Difference
0   1        A    X              1.0
1   1        A    Y             24.0
2   1        B    X              1.0
3   2        A    X              1.0

Calculate Time Difference based on Conditionals

Related Q&A

Cannot open jpg images with PIL or open()

how to delete tensorflow model before retraining

Use beautifulsoup to scrape a table within a webpage?

Encrypt folder or zip file using python

Use Python Element Tree to parse xml in ASCII text file [closed]

Time series plot showing unique occurrences per day

Problem accessing indexed results two stage stochastic programming Pyomo

Pandas python + format for values

Is there any implementation of deconvolution?

discord.py How to check if user is on server?