I'm trying to take my dataframe from a long format in which I have a column with a categorical variable, into a wide format in which each category has it's own price column. Currently, my data looks like this:
date-time date vendor payment_type price
03-10-15 10:00:00 03-10-15 A1 1 50
03-10-15 10:00:00 03-10-15 A1 2 60
03-10-15 10:00:00 03-11-15 A1 1 45
03-10-15 10:00:00 03-11-15 A1 2 70
03-10-15 10:00:00 03-12-15 B1 1 40
03-10-15 10:00:00 03-12-15 B1 2 45
03-10-15 10:00:00 03-10-15 C1 1 60
03-10-15 10:00:00 03-10-15 C1 1 65
My goal is to have a column for every vendor's price and for each payment type and one row per day. When there are multiple values per day, I want to use the maximum value. The end result should look something like this.
Date A1_Pay1 A2_Pay2 ... C1_Pay1 C1_Pay2
03-10-15 50 60 ... 65 NaN
03-11-15 45 70 ... NaN NaN
03-12-15 NaN NaN ... NaN NaN
I tried using unstack and pivot, but I either wasn't getting what I was going for, or was getting an error about Date not being a unique index.
Any ideas?
You can use pivot_table
:
#convert column payment_type to string
df['payment_type'] = df['payment_type'].astype(str)df = pd.pivot_table(df, index='date', columns=['vendor', 'payment_type'], aggfunc=max)#remove top level of multiindex
df.columns = df.columns.droplevel(0)#reset multicolumns
df.columns = ['_Pay'.join(col).strip() for col in df.columns.values]print dfA1_Pay1 A1_Pay2 B1_Pay1 B1_Pay2 C1_Pay1
date
2015-03-10 50 60 NaN NaN 65
2015-03-11 45 70 NaN NaN NaN
2015-03-12 NaN NaN 40 45 NaN
EDIT:
If you need other statistics, you can add them as list to aggfunc
:
#convert column payment_type to string
df['payment_type'] = df['payment_type'].astype(str)
df = pd.pivot_table(df, index='date', columns=['vendor', 'payment_type'], aggfunc=[np.mean, np.max, np.median])
print dfmean amax median \price price price
vendor A1 B1 C1 A1 B1 C1 A1 B1
payment_type 1 2 1 2 1 1 2 1 2 1 1 2 1 2
date
2015-03-10 50 60 NaN NaN 62.5 50 60 NaN NaN 65 50 60 NaN NaN
2015-03-11 45 70 NaN NaN NaN 45 70 NaN NaN NaN 45 70 NaN NaN
2015-03-12 NaN NaN 40 45 NaN NaN NaN 40 45 NaN NaN NaN 40 45 vendor C1
payment_type 1
date
2015-03-10 62.5
2015-03-11 NaN
2015-03-12 NaN
#remove top level of multiindex
df.columns = df.columns.droplevel(1)
#reset multicolumns
df.columns = ['_Pay'.join(col).strip() for col in df.columns.values]
print dfmean_PayA1_Pay1 mean_PayA1_Pay2 mean_PayB1_Pay1 \
date
2015-03-10 50 60 NaN
2015-03-11 45 70 NaN
2015-03-12 NaN NaN 40 mean_PayB1_Pay2 mean_PayC1_Pay1 amax_PayA1_Pay1 \
date
2015-03-10 NaN 62.5 50
2015-03-11 NaN NaN 45
2015-03-12 45 NaN NaN amax_PayA1_Pay2 amax_PayB1_Pay1 amax_PayB1_Pay2 \
date
2015-03-10 60 NaN NaN
2015-03-11 70 NaN NaN
2015-03-12 NaN 40 45 amax_PayC1_Pay1 median_PayA1_Pay1 median_PayA1_Pay2 \
date
2015-03-10 65 50 60
2015-03-11 NaN 45 70
2015-03-12 NaN NaN NaN median_PayB1_Pay1 median_PayB1_Pay2 median_PayC1_Pay1
date
2015-03-10 NaN NaN 62.5
2015-03-11 NaN NaN NaN
2015-03-12 40 45 NaN