Question 1

I have a pandas dataframe like this :

df = pd.DataFrame([['A', 1234, 20120201],['A', 1134, 20120201],['A', 1011, 20120201],['A', 1123, 20121004],['A', 1111, 20121004],['A', 1224, 20121105],['B', 1156, 20120403],['B', 2345, 20120504],['B', 4567, 20120504],['B', 8796, 20120606]], columns = ['company', 'invoice', 'date'])

The aim is to create a new column called 'TotalPaidInvoices' which counts number of invoices paid prior to each record.

I tried the following

df['date'] = pd.to_datetime(df['date'])
df = df.sort_values(['company', 'date'], ascending=[True, True]).reset_index(drop=True)
df['totalpaidinvoices']= df[(df['date'] != df['date'].shift(1))].groupby(['company']).cumcount()
df['totalpaidinvoices']= df.groupby('company')['totalpaidinvoices'].fillna(method='ffill')

But instead of number of invoices what I get is number of company - date combinations prior to the current record.

Output :

df = pd.DataFrame([['A', 1234, 20120201, 0.0],['A', 1134, 20120201, 0.0],['A', 1011, 20120201, 0.0],['A', 1123, 20121004, 1.0],['A', 1111, 20121004, 1.0],['A', 1224, 20121105, 2.0],['B', 1156, 20120403, 0.0],['B', 2345, 20120504, 1.0],['B', 4567, 20120504, 1.0],['B', 8796, 20120606, 2.0]], columns = ['company', 'invoice', 'date', 'totalpaidinvoices'])

Expected output :

df = pd.DataFrame([['A', 1234, 20120201, 0.0],['A', 1134, 20120201, 0.0],['A', 1011, 20120201, 0.0],['A', 1123, 20121004, 3.0],['A', 1111, 20121004, 3.0],['A', 1224, 20121105, 5.0],['B', 1156, 20120403, 0.0],['B', 2345, 20120504, 1.0],['B', 4567, 20120504, 1.0],['B', 8796, 20120606, 3.0]], columns = ['company', 'invoice', 'date', 'totalpaidinvoices'])

Any suggestions to fix?

Question 2

First, let's count the number of invoices paid on each day for each company:

tmp1 = df.groupby(['company', 'date']).size().rename('totalpaidinvoices')

Then for each company, we need to count how many invoices were paid prior to the current period. That's a job for cumsum:

tmp2 = tmp1.groupby('company').apply(lambda s: s.cumsum() - s)

And finally, merge the calculation with the original dataframe:

df.merge(tmp2, left_on=['company', 'date'], right_index=True)

If you prefer method chaining:

result = (df.groupby(['company', 'date']).size().groupby('company').apply(lambda s: s.cumsum() - s).to_frame('totalpaidinvoices').merge(df, how='right', left_index=True, right_on=['company', 'date'])
)

Cumulative count at a group level Python

Related Q&A

Easiest ways to generate graphs from Python? [closed]

Stripping python namespace attributes from an lxml.objectify.ObjectifiedElement [duplicate]

matplotlib xkcd and black figure background

Python: Whats the difference between set.difference and set.difference_update?

python telebot got unexpected response

How to set correct value for Django ROOT_URLCONF setting in different branches

How do I improve scrapys download speed?

Convert numpy, list or float to string in python

Shared XMPP connection between Celery workers

List of installed fonts OS X / C