I'm using Pandas pivot_table
function on a large dataset (10 million rows, 6 columns). As execution time is paramount, I try to speed up the process. Currently it takes around 8 secs to process the whole dataset which is way to slow and I hope to find alternatives to improve speed/performance.
My current Pandas pivot_table:
df_pivot = df_original.pivot_table(index="industry", columns = "months",values = ["orders", "client_name"],aggfunc ={"orders": np.sum, "client_name": pd.Series.nunique})
df_original
includes all the data (10m rows, imported from a csv). Industry is the client's industry, months are the order months (Jan to Dec), orders are the number of orders. All data was converted to categorical
data, except number of orders (int
datatype). Originally industry, months and client_name were strings.
I tried using pandas.DataFrame.unstack
- which was even slower. Also I experimented with Dask
. The dask pivot_table
yielded some improvement (6 sec execution time - so 2 sec less). However, it is still pretty slow.
Are there any faster alternatives (for large datasets)? Maybe recreation of the pivot table with groupy
, crosstab
, ... Unfortunately, I did not get the alternatives to work at all and I am still quite new to Python and Pandas...
Looking forward to your suggestions. Thanks in advance!
Update:
I figured out the groupby way with:
df_new = df_original.groupby(["months", "industry"]).agg({"orders": np.sum, "client_name": pd.Series.nunique}).unstack(level="months").fillna(0)
This is much faster now with about 2-3 secs. Are there still some options to improve speed further?