Question 1

I have a Pandas dataframe containing records for a lot of orders, one recorde for each order. Each record has order_id, category_id, created_at and picked_at. I need to calculate queue length for each order at the time of it's creation. Which means for each record current_order I need to count the number of rows with following conditions:

must have the same category_id as the current_order
must be created before created_at of the current_order
must be picked after created_at of the current_order

The dataframe is quite larg hence doing the calculation using a loop is too time consuming. How can I do this faster?

Any help would be greatly appreciated.

Edited

A sample of dataframe:

          id  category_id          created_at           picked_at
0  123228779        69558 2021-05-22 00:08:46 2021-05-22 00:22:45
1  123228972        69558 2021-05-22 00:12:39 2021-05-22 00:17:00
2  123229120         6725 2021-05-22 00:15:47 2021-05-22 00:42:50
3  123229210        41358 2021-05-22 00:17:44 2021-05-22 00:35:34
4  123229152         6725 2021-05-22 00:16:29 2021-05-22 01:05:43

Question 2

Let's first start by reshaping the dataframe to have created_at and picked_at in the same column. Then we calculate the queue value.

df2 = (df.melt(id_vars=['id', 'category_id'],var_name='type',value_name='time').sort_values(by=['category_id', 'time']) # not required to sort by "category_id",# but done here for clarity)df2['queue'] = (df2['type'].map({'created_at': 1, 'picked_at': -1}).cumsum())

>>> df2id  category_id        type                time  queue
2  123229120         6725  created_at 2021-05-22 00:15:47      1
4  123229152         6725  created_at 2021-05-22 00:16:29      2
7  123229120         6725   picked_at 2021-05-22 00:42:50      1
9  123229152         6725   picked_at 2021-05-22 01:05:43      0
3  123229210        41358  created_at 2021-05-22 00:17:44      1
8  123229210        41358   picked_at 2021-05-22 00:35:34      0
0  123228779        69558  created_at 2021-05-22 00:08:46      1
1  123228972        69558  created_at 2021-05-22 00:12:39      2
6  123228972        69558   picked_at 2021-05-22 00:17:00      1
5  123228779        69558   picked_at 2021-05-22 00:22:45      0

Finally, we reshape the queue to the original dataframe:

df['queue'] = (df2.pivot(columns=['type'],values=['queue']).loc[:, ('queue', 'created_at')].dropna().astype(int))

output:

          id  category_id          created_at           picked_at  queue
0  123228779        69558 2021-05-22 00:08:46 2021-05-22 00:22:45      1
1  123228972        69558 2021-05-22 00:12:39 2021-05-22 00:17:00      2
2  123229120         6725 2021-05-22 00:15:47 2021-05-22 00:42:50      1
3  123229210        41358 2021-05-22 00:17:44 2021-05-22 00:35:34      1
4  123229152         6725 2021-05-22 00:16:29 2021-05-22 01:05:43      2

NB. this gives us the queue, per category_id, after creation.

calculate the queue for orders based on creation and delivery date, by product group

Related Q&A

Python print with string invalid syntax

How to load images and text labels for CNN regression from different folders

How to calculate number of dates within a year of a date in pandas

Remove substring from string if substring in list in data frame column

how to save data in the db django model?

Move existing jointplot legend

timezone conversion of a large list of timestamps from an excel file with python

N_gram frequency python NTLK

Is there a way to have a list of 4 billion numbers in Python?

ValueError: invalid literal for int() with base 10: when it worked before