Compute on pandas dataframe concurrently

2024/9/23 14:24:57

Is it feasible to do multiple group-wise calculation in dataframe in pandas concurrently and get those results back? So, I'd like to compute the following sets of dataframe and get those results one-by-one, and finally merge them into one dataframe.

df_a = df.groupby(["state", "person"]).apply(lambda x: np.mean(x["height"]))
df_b = df.groupby(["state", "person"]).apply(lambda x: np.mean(x["weight"]))
df_c = df.groupby(["state", "person"]).apply(lambda x: xp["number"].sum())

And then,

df_final = merge(df_a, df_b) # omitting the irrelevant part

However, as far as I know, those functionalities at multiprocessing don't fit my needs here, since it looks more like concurrently run multiple functions that don't return the internally-created, local variables, and instead just print some output within the function (e.g. oft-used is_prime function), or concurrently run a single function with different sets of arguments (e.g. map function in multiprocessing), if I understand it correctly (I'm not sure I understand it correctly, so correct me if I'm wrong!).

However, what I'd like to implement is just run those three (and actually, more) simultaneously and finally merge them together, once all of those computation on dataframe are completed successfully. I assume the kind of functionalities implemented in Go (goroutines and channels), by perhaps creating each function respectively, running them one-by-one, concurrently, waiting for all of them completed, and finally merging them together.

So how can it be written in Python? I read the documentation of multiprocessing, threading, and concurrent.futures, but all of them are too elusive for me, that I don't even understand whether I can utilize those libraries to begin with...

(I made the code precise for the purpose of brevity and the actual code is more complicated, so please don't answer "Yeah, you can write it in one line and in non-concurrent way" or something like that.)

Thanks.

Answer

9 Months later and this is still one of the top results for working with multiprocessing and pandas. I hope you've found some type of answer at this point, but if not I've got one that seems to work and hopefully it will help others who view this question.

import pandas as pd
import numpy as np
#sample data
df = pd.DataFrame([[1,2,3,1,2,3,1,2,3,1],[2,2,2,2,2,2,2,2,2,2],[1,3,5,7,9,2,4,6,8,0],[2,4,6,8,0,1,3,5,7,9]]).transpose()
df.columns=['a','b','c','d']
dfa  b  c  d
0  1  2  1  2
1  2  2  3  4
2  3  2  5  6
3  1  2  7  8
4  2  2  9  0
5  3  2  2  1
6  1  2  4  3
7  2  2  6  5
8  3  2  8  7
9  1  2  0  9#this one function does the three functions you had used in your question, obviously you could add more functions or different ones for different groupby things
def f(x):return [np.mean(x[1]['c']),np.mean(x[1]['d']),x[1]['d'].sum()]#sets up a pool with 4 cpus
from multiprocessing import Pool
pool = Pool(4)#runs the statistics you wanted on each group
group_df = pd.DataFrame(pool.map(f,df.groupby(['a','b'])))
group_df0         1   2
0  3  5.500000  22
1  6  3.000000   9
2  5  4.666667  14group_df['keys']=df.groupby(['a','b']).groups.keys()group_df0         1   2    keys
0  3  5.500000  22  (1, 2)
1  6  3.000000   9  (3, 2)
2  5  4.666667  14  (2, 2)

At the least I hope this helps someone who's looking at this stuff in the future

https://en.xdnf.cn/q/71814.html

Related Q&A

How do I go about writing a program to send and receive sms using python?

I have looked all over the net for a good library to use in sending and receiving smss using python but all in vain!Are there GSM libraries for python out there?

Persist Completed Pipeline in Luigi Visualiser

Im starting to port a nightly data pipeline from a visual ETL tool to Luigi, and I really enjoy that there is a visualiser to see the status of jobs. However, Ive noticed that a few minutes after the l…

How to assign python requests sessions for single processes in multiprocessing pool?

Considering the following code example:import multiprocessing import requestssession = requests.Session() data_to_be_processed = [...]def process(arg):# do stuff with arg and get urlresponse = session.…

Missing values in Pandas Pivot table?

I have a data set that looks like the following:student question answer number Bob How many donuts in a dozen? A 1 Sally How many donuts in a do…

Selecting Element followed by text with Selenium WebDriver

I am using Selenium WebDriver and the Python bindings to automate some monotonous WordPress tasks, and it has been pretty straightforward up until this point. I am trying to select a checkbox, but the …

AttributeError: module keras.backend has no attribute image_dim_ordering

I tried to execute some tutorial transfer learning project. But Ive got attribute error.I checked my tensorflow and keras version.tensorflow : 1.14.0 keras : 2.2.5and python 3.6.9 version.the code is h…

Python Interpreter String Pooling Optimization [duplicate]

This question already has answers here:What determines which strings are interned and when? [duplicate](3 answers)Closed 6 years ago.After seeing this question and its duplicate a question still remai…

Flattening an array in pandas

One of the columns in DataFrame is an array. How do I flatten it? column1 column2 column3 var1 var11 [1, 2, 3, 4] var2 var22 [1, 2, 3, 4, -2, 12] var3 var33 [1, 2, 3, 4, 33, 544]Afte…

Difficulty in using sympy solver in python

Please run the following codefrom sympy.solvers import solvefrom sympy import Symbolx = Symbol(x)R2 = solve(-109*x**5/3870720+4157*x**4/1935360-3607*x**3/69120+23069*x**2/60480+5491*x/2520+38-67,x)prin…

Add custom html between two model fields in Django admins change_form

Lets say Ive two models:class Book(models.Model):name = models.CharField(max_length=50)library = models.ForeignKeyField(Library)class Library(models.Model):name = models.CharField(max_length=50) addr…