I have a pandas dataframe like this :
df = pd.DataFrame([['A', 1234, 20120201],['A', 1134, 20120201],['A', 1011, 20120201],['A', 1123, 20121004],['A', 1111, 20121004],['A', 1224, 20121105],['B', 1156, 20120403],['B', 2345, 20120504],['B', 4567, 20120504],['B', 8796, 20120606]], columns = ['company', 'invoice', 'date'])
The aim is to create a new column called 'TotalPaidInvoices' which counts number of invoices paid prior to each record.
I tried the following
df['date'] = pd.to_datetime(df['date'])
df = df.sort_values(['company', 'date'], ascending=[True, True]).reset_index(drop=True)
df['totalpaidinvoices']= df[(df['date'] != df['date'].shift(1))].groupby(['company']).cumcount()
df['totalpaidinvoices']= df.groupby('company')['totalpaidinvoices'].fillna(method='ffill')
But instead of number of invoices what I get is number of company - date combinations prior to the current record.
Output :
df = pd.DataFrame([['A', 1234, 20120201, 0.0],['A', 1134, 20120201, 0.0],['A', 1011, 20120201, 0.0],['A', 1123, 20121004, 1.0],['A', 1111, 20121004, 1.0],['A', 1224, 20121105, 2.0],['B', 1156, 20120403, 0.0],['B', 2345, 20120504, 1.0],['B', 4567, 20120504, 1.0],['B', 8796, 20120606, 2.0]], columns = ['company', 'invoice', 'date', 'totalpaidinvoices'])
Expected output :
df = pd.DataFrame([['A', 1234, 20120201, 0.0],['A', 1134, 20120201, 0.0],['A', 1011, 20120201, 0.0],['A', 1123, 20121004, 3.0],['A', 1111, 20121004, 3.0],['A', 1224, 20121105, 5.0],['B', 1156, 20120403, 0.0],['B', 2345, 20120504, 1.0],['B', 4567, 20120504, 1.0],['B', 8796, 20120606, 3.0]], columns = ['company', 'invoice', 'date', 'totalpaidinvoices'])
Any suggestions to fix?