Is it feasible to do multiple group-wise calculation in dataframe in pandas concurrently and get those results back? So, I'd like to compute the following sets of dataframe and get those results one-by-one, and finally merge them into one dataframe.
df_a = df.groupby(["state", "person"]).apply(lambda x: np.mean(x["height"]))
df_b = df.groupby(["state", "person"]).apply(lambda x: np.mean(x["weight"]))
df_c = df.groupby(["state", "person"]).apply(lambda x: xp["number"].sum())
And then,
df_final = merge(df_a, df_b) # omitting the irrelevant part
However, as far as I know, those functionalities at multiprocessing
don't fit my needs here, since it looks more like concurrently run multiple functions that don't return the internally-created, local variables, and instead just print some output within the function (e.g. oft-used is_prime
function), or concurrently run a single function with different sets of arguments (e.g. map
function in multiprocessing
), if I understand it correctly (I'm not sure I understand it correctly, so correct me if I'm wrong!).
However, what I'd like to implement is just run those three (and actually, more) simultaneously and finally merge them together, once all of those computation on dataframe are completed successfully. I assume the kind of functionalities implemented in Go
(goroutines
and channels
), by perhaps creating each function respectively, running them one-by-one, concurrently, waiting for all of them completed, and finally merging them together.
So how can it be written in Python? I read the documentation of multiprocessing
, threading
, and concurrent.futures
, but all of them are too elusive for me, that I don't even understand whether I can utilize those libraries to begin with...
(I made the code precise for the purpose of brevity and the actual code is more complicated, so please don't answer "Yeah, you can write it in one line and in non-concurrent way" or something like that.)
Thanks.