Cumulative count at a group level Python

2024/10/8 22:19:07

I have a pandas dataframe like this :

df = pd.DataFrame([['A', 1234, 20120201],['A', 1134, 20120201],['A', 1011, 20120201],['A', 1123, 20121004],['A', 1111, 20121004],['A', 1224, 20121105],['B', 1156, 20120403],['B', 2345, 20120504],['B', 4567, 20120504],['B', 8796, 20120606]], columns = ['company', 'invoice', 'date'])

The aim is to create a new column called 'TotalPaidInvoices' which counts number of invoices paid prior to each record.

I tried the following

df['date'] = pd.to_datetime(df['date'])
df = df.sort_values(['company', 'date'], ascending=[True, True]).reset_index(drop=True)
df['totalpaidinvoices']= df[(df['date'] != df['date'].shift(1))].groupby(['company']).cumcount()
df['totalpaidinvoices']= df.groupby('company')['totalpaidinvoices'].fillna(method='ffill')

But instead of number of invoices what I get is number of company - date combinations prior to the current record.

Output :

df = pd.DataFrame([['A', 1234, 20120201, 0.0],['A', 1134, 20120201, 0.0],['A', 1011, 20120201, 0.0],['A', 1123, 20121004, 1.0],['A', 1111, 20121004, 1.0],['A', 1224, 20121105, 2.0],['B', 1156, 20120403, 0.0],['B', 2345, 20120504, 1.0],['B', 4567, 20120504, 1.0],['B', 8796, 20120606, 2.0]], columns = ['company', 'invoice', 'date', 'totalpaidinvoices'])

Expected output :

df = pd.DataFrame([['A', 1234, 20120201, 0.0],['A', 1134, 20120201, 0.0],['A', 1011, 20120201, 0.0],['A', 1123, 20121004, 3.0],['A', 1111, 20121004, 3.0],['A', 1224, 20121105, 5.0],['B', 1156, 20120403, 0.0],['B', 2345, 20120504, 1.0],['B', 4567, 20120504, 1.0],['B', 8796, 20120606, 3.0]], columns = ['company', 'invoice', 'date', 'totalpaidinvoices'])

Any suggestions to fix?

Answer

First, let's count the number of invoices paid on each day for each company:

tmp1 = df.groupby(['company', 'date']).size().rename('totalpaidinvoices')

Then for each company, we need to count how many invoices were paid prior to the current period. That's a job for cumsum:

tmp2 = tmp1.groupby('company').apply(lambda s: s.cumsum() - s)

And finally, merge the calculation with the original dataframe:

df.merge(tmp2, left_on=['company', 'date'], right_index=True)

If you prefer method chaining:

result = (df.groupby(['company', 'date']).size().groupby('company').apply(lambda s: s.cumsum() - s).to_frame('totalpaidinvoices').merge(df, how='right', left_index=True, right_on=['company', 'date'])
)
https://en.xdnf.cn/q/70092.html

Related Q&A

Easiest ways to generate graphs from Python? [closed]

Closed. This question is seeking recommendations for books, tools, software libraries, and more. It does not meet Stack Overflow guidelines. It is not currently accepting answers.We don’t allow questi…

Stripping python namespace attributes from an lxml.objectify.ObjectifiedElement [duplicate]

This question already has answers here:Closed 11 years ago.Possible Duplicate:When using lxml, can the XML be rendered without namespace attributes? How can I strip the python attributes from an lxml…

matplotlib xkcd and black figure background

I am trying to make a plot using matplotlibs xkcd package while having a black background. However, xkcd seems to add a sort of white contour line around text and lines. On a white background you cant …

Python: Whats the difference between set.difference and set.difference_update?

s.difference(t) returns a new set with no elements in t.s.difference_update(t) returns an updated set with no elements in t.Whats the difference between these two set methods? Because the difference_u…

python telebot got unexpected response

I have been using my Telegram bot for sending me different notifications from my desktop computer using pythons telebot library. Everything was working properly for quite a long time, but one day it st…

How to set correct value for Django ROOT_URLCONF setting in different branches

Ive put site directory created by django-admin startproject under version control (Mercurial). Lets say, the site is called frobnicator.Now I want to make some serious refactoring, so I clone the site …

How do I improve scrapys download speed?

Im using scrapy to download pages from many different domains in parallel. I have hundreds of thousands of pages to download, so performance is important.Unfortunately, as Ive profiled scrapys speed, …

Convert numpy, list or float to string in python

Im writing a python function to append data to text file, as shown in the following,The problem is the variable, var, could be a 1D numpy array, a 1D list, or just a float number, I know how to convert…

Shared XMPP connection between Celery workers

My web app needs to be able to send XMPP messages (Facebook Chat), and I thought Celery might be a good solution for this. A task would consist of querying the database and sending the XMPP message to …

List of installed fonts OS X / C

Im trying to programatically get a list of installed fonts in C or Python. I need to be able to do this on OS X, does anyone know how?