I have a Pandas dataframe containing records for a lot of orders, one recorde for each order. Each record has order_id
, category_id
, created_at
and picked_at
. I need to calculate queue length for each order at the time of it's creation. Which means for each record current_order
I need to count the number of rows with following conditions:
- must have the same
category_id
as the current_order
- must be created before
created_at
of the current_order
- must be picked after
created_at
of the current_order
The dataframe is quite larg hence doing the calculation using a loop is too time consuming.
How can I do this faster?
Any help would be greatly appreciated.
Edited
A sample of dataframe:
id category_id created_at picked_at
0 123228779 69558 2021-05-22 00:08:46 2021-05-22 00:22:45
1 123228972 69558 2021-05-22 00:12:39 2021-05-22 00:17:00
2 123229120 6725 2021-05-22 00:15:47 2021-05-22 00:42:50
3 123229210 41358 2021-05-22 00:17:44 2021-05-22 00:35:34
4 123229152 6725 2021-05-22 00:16:29 2021-05-22 01:05:43
Let's first start by reshaping the dataframe to have created_at
and picked_at
in the same column. Then we calculate the queue value.
df2 = (df.melt(id_vars=['id', 'category_id'],var_name='type',value_name='time').sort_values(by=['category_id', 'time']) # not required to sort by "category_id",# but done here for clarity)df2['queue'] = (df2['type'].map({'created_at': 1, 'picked_at': -1}).cumsum())
>>> df2id category_id type time queue
2 123229120 6725 created_at 2021-05-22 00:15:47 1
4 123229152 6725 created_at 2021-05-22 00:16:29 2
7 123229120 6725 picked_at 2021-05-22 00:42:50 1
9 123229152 6725 picked_at 2021-05-22 01:05:43 0
3 123229210 41358 created_at 2021-05-22 00:17:44 1
8 123229210 41358 picked_at 2021-05-22 00:35:34 0
0 123228779 69558 created_at 2021-05-22 00:08:46 1
1 123228972 69558 created_at 2021-05-22 00:12:39 2
6 123228972 69558 picked_at 2021-05-22 00:17:00 1
5 123228779 69558 picked_at 2021-05-22 00:22:45 0
Finally, we reshape the queue to the original dataframe:
df['queue'] = (df2.pivot(columns=['type'],values=['queue']).loc[:, ('queue', 'created_at')].dropna().astype(int))
output:
id category_id created_at picked_at queue
0 123228779 69558 2021-05-22 00:08:46 2021-05-22 00:22:45 1
1 123228972 69558 2021-05-22 00:12:39 2021-05-22 00:17:00 2
2 123229120 6725 2021-05-22 00:15:47 2021-05-22 00:42:50 1
3 123229210 41358 2021-05-22 00:17:44 2021-05-22 00:35:34 1
4 123229152 6725 2021-05-22 00:16:29 2021-05-22 01:05:43 2
NB. this gives us the queue, per category_id
, after creation.