Question 1

I'm using Pandas pivot_table function on a large dataset (10 million rows, 6 columns). As execution time is paramount, I try to speed up the process. Currently it takes around 8 secs to process the whole dataset which is way to slow and I hope to find alternatives to improve speed/performance.

My current Pandas pivot_table:

df_pivot = df_original.pivot_table(index="industry", columns = "months",values = ["orders", "client_name"],aggfunc ={"orders": np.sum, "client_name": pd.Series.nunique})

df_original includes all the data (10m rows, imported from a csv). Industry is the client's industry, months are the order months (Jan to Dec), orders are the number of orders. All data was converted to categorical data, except number of orders (int datatype). Originally industry, months and client_name were strings.

I tried using pandas.DataFrame.unstack - which was even slower. Also I experimented with Dask. The dask pivot_table yielded some improvement (6 sec execution time - so 2 sec less). However, it is still pretty slow. Are there any faster alternatives (for large datasets)? Maybe recreation of the pivot table with groupy, crosstab, ... Unfortunately, I did not get the alternatives to work at all and I am still quite new to Python and Pandas... Looking forward to your suggestions. Thanks in advance!

Update:

I figured out the groupby way with:

df_new = df_original.groupby(["months", "industry"]).agg({"orders": np.sum, "client_name": pd.Series.nunique}).unstack(level="months").fillna(0)

This is much faster now with about 2-3 secs. Are there still some options to improve speed further?

Question 2

Convert the columns months and industry to categorical columns: https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html This way you avoid a lot of string comparisons.

Faster alternatives to Pandas pivot_table

Related Q&A

How can I temporarily redirect the output of logging in Python?

trouble with creating a virtual environment in Windows 8, python 3.3

Python imaplib search email with date and time

cumsum() on multi-index pandas dataframe

Python SSL Certification Problems in Tensorflow

How do I get a python program to run instead of opening in Notepad?

How to find a keys value from a list of dictionaries?

Wandering star - codeabbey task

Find delimiter in txt to convert to csv using Python

Assert mocked function called with json string in python