Faster alternatives to Pandas pivot_table

2024/10/2 3:18:07

I'm using Pandas pivot_table function on a large dataset (10 million rows, 6 columns). As execution time is paramount, I try to speed up the process. Currently it takes around 8 secs to process the whole dataset which is way to slow and I hope to find alternatives to improve speed/performance.

My current Pandas pivot_table:

df_pivot = df_original.pivot_table(index="industry", columns = "months",values = ["orders", "client_name"],aggfunc ={"orders": np.sum, "client_name": pd.Series.nunique})

df_original includes all the data (10m rows, imported from a csv). Industry is the client's industry, months are the order months (Jan to Dec), orders are the number of orders. All data was converted to categorical data, except number of orders (int datatype). Originally industry, months and client_name were strings.

I tried using pandas.DataFrame.unstack - which was even slower. Also I experimented with Dask. The dask pivot_table yielded some improvement (6 sec execution time - so 2 sec less). However, it is still pretty slow. Are there any faster alternatives (for large datasets)? Maybe recreation of the pivot table with groupy, crosstab, ... Unfortunately, I did not get the alternatives to work at all and I am still quite new to Python and Pandas... Looking forward to your suggestions. Thanks in advance!

Update:

I figured out the groupby way with:

df_new = df_original.groupby(["months", "industry"]).agg({"orders": np.sum, "client_name": pd.Series.nunique}).unstack(level="months").fillna(0)

This is much faster now with about 2-3 secs. Are there still some options to improve speed further?

Answer

Convert the columns months and industry to categorical columns: https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html This way you avoid a lot of string comparisons.

https://en.xdnf.cn/q/70893.html

Related Q&A

How can I temporarily redirect the output of logging in Python?

Theres already a question that answers how to do this regarding sys.stdout and sys.stderr here: https://stackoverflow.com/a/14197079/198348 But that doesnt work everywhere. The logging module seems to …

trouble with creating a virtual environment in Windows 8, python 3.3

Im trying to create a virtual environment in Python, but I always get an error no matter how many times I re-install python-setuptools and pip. My computer is running Windows 8, and Im using Python 3.3…

Python imaplib search email with date and time

Im trying to read all emails from a particular date and time. mail = imaplib.IMAP4_SSL(self.url, self.port) mail.login(user, password) mail.select(self.folder) since = datetime.strftime(since, %d-%b-%Y…

cumsum() on multi-index pandas dataframe

I have a multi-index dataframe that shows the sum of transactions on a monthly frequency. I am trying to get a cumsum() on yearly basis that respects my mapid and service multi-index. However I dont kn…

Python SSL Certification Problems in Tensorflow

Im trying to download the MNIST data which is supposedly handled in: tensorflow.examples.tutorials.mnist.input_data.read_data_sets() As far as Im aware read_data_sets sends a pull request to a server t…

How do I get a python program to run instead of opening in Notepad?

I am having some trouble with opening a .py file. I have a program that calls this .py file (i.e. pathname/example.py file.txt), but instead of running the python program, it opens it in Notepad. How t…

How to find a keys value from a list of dictionaries?

How do I get a given keys value from a list of dictionaries? mylist = [{powerpoint_color: blue,client_name: Sport Parents (Regrouped)},{sort_order: ascending,chart_layout: 1,chart_type: bar} ]The numb…

Wandering star - codeabbey task

Im trying to solve this problem and Im not sure what to do next. Link to the problem Problem statement: Suppose that some preliminary image preprocessing was already done and you have data in form of …

Find delimiter in txt to convert to csv using Python

I have to convert some txt files to csv (and make some operation during the conversion).I use csv.Sniffer() class to detect wich delimiter is used in the txt This codewith open(filename_input, r) as f1…

Assert mocked function called with json string in python

Writing some unit tests in python and using MagicMock to mock out a method that accepts a JSON string as input. In my unit test, I want to assert that it is called with given arguments, however I run i…