Optimizing cartesian product between two Pandas Dataframe

2024/9/16 23:13:39

I have two dataframes with the same columns:

Dataframe 1:

          attr_1  attr_77 ... attr_8
userID                              
John      1.2501  2.4196  ... 1.7610
Charles   0.0000  1.0618  ... 1.4813
Genarito  2.7037  4.6707  ... 5.3583
Mark      9.2775  6.7638  ... 6.0071

Dataframe 2:

          attr_1  attr_77 ... attr_8
petID                              
Firulais  1.2501  2.4196  ... 1.7610
Connie    0.0000  1.0618  ... 1.4813
PopCorn   2.7037  4.6707  ... 5.3583

I want to generate a correlation and p-value dataframe of all posible combinations, this would be the result:

   userId   petID      Correlation    p-value
0  John     Firulais   0.091447       1.222927e-02
1  John     Connie     0.101687       5.313359e-03
2  John     PopCorn    0.178965       8.103919e-07
3  Charles  Firulais   -0.078460      3.167896e-02

The problem is that the cartesian product generates more than 3 million tuples. Taking minutes to finish. This is my code, I've written two alternatives:

First of all, initial DataFrames:

df1 = pd.DataFrame({'userID': ['John', 'Charles', 'Genarito', 'Mark'],'attr_1': [1.2501, 0.0, 2.7037, 9.2775],'attr_77': [2.4196, 1.0618, 4.6707, 6.7638],'attr_8': [1.7610, 1.4813, 5.3583, 6.0071]
}).set_index('userID')df2 = pd.DataFrame({'petID': ['Firulais', 'Connie', 'PopCorn'],'attr_1': [1.2501, 0.0, 2.7037],'attr_77': [2.4196, 1.0618, 4.6707],'attr_8': [1.7610, 1.4813, 5.3583]
}).set_index('petID')

Option 1:

# Pre-allocate space
df1_keys = df1.index
res_row_count = len(df1_keys) * df2.values.shape[0]
genes = np.empty(res_row_count, dtype='object')
mature_mirnas = np.empty(res_row_count, dtype='object')
coff = np.empty(res_row_count)
p_value = np.empty(res_row_count)i = 0
for df1_key in df1_keys:df1_values = df1.loc[df1_key, :].valuesfor df2_key in df2.index:df2_values = df2.loc[df2_key, :]pearson_res = pearsonr(df1_values, df2_values)users[i] = df1_keypets[i] = df2_keycoff[i] = pearson_res[0]p_value[i] = pearson_res[1]i += 1# After loop, creates the resulting Dataframe
return pd.DataFrame(data={'userID': users,'petID': pets,'Correlation': coff,'p-value': p_value
})

Option 2 (slower), from here:

# Makes a merge between all the tuples
def df_crossjoin(df1_file_path, df2_file_path):df1, df2 = prepare_df(df1_file_path, df2_file_path)df1['_tmpkey'] = 1df2['_tmpkey'] = 1res = pd.merge(df1, df2, on='_tmpkey').drop('_tmpkey', axis=1)res.index = pd.MultiIndex.from_product((df1.index, df2.index))df1.drop('_tmpkey', axis=1, inplace=True)df2.drop('_tmpkey', axis=1, inplace=True)return res# Computes Pearson Coefficient for all the tuples
def compute_pearson(row):values = np.split(row.values, 2)return pearsonr(values[0], values[1])result = df_crossjoin(mrna_file, mirna_file).apply(compute_pearson, axis=1)

Is there a faster way to solve such a problem with Pandas? Or I'll have no more option than parallelize the iterations?

Edit:

As the size of the dataframe increases the second option results in a better runtime, but It's still taking seconds to finish.

Thanks in advance

Answer

Of all the alternatives tested, the one that gave me the best results was the following:

  1. An iteration product was made with itertools.product().

  2. All the iterations on both iterrows were performed on a Pool of parallel processes (using a map function).

To give it a little more performance, the function compute_row_cython was compiled with Cython as it is advised in this section of the Pandas documentation:

In the cython_modules.pyx file:

from scipy.stats import pearsonr
import numpy as npdef compute_row_cython(row):(df1_key, df1_values), (df2_key, df2_values) = rowcdef (double, double) pearsonr_res = pearsonr(df1_values.values, df2_values.values)return df1_key, df2_key, pearsonr_res[0], pearsonr_res[1]

Then I set up the setup.py:

from distutils.core import setup
from Cython.Build import cythonizesetup(name='Compiled Pearson',ext_modules=cythonize("cython_modules.pyx")

Finally I compiled it with: python setup.py build_ext --inplace

The final code was left, then:

import itertools
import multiprocessing
from cython_modules import compute_row_cythonNUM_CORES = multiprocessing.cpu_count() - 1pool = multiprocessing.Pool(NUM_CORES)
# Calls to Cython function defined in cython_modules.pyx
res = zip(*pool.map(compute_row_cython, itertools.product(df1.iterrows(), df2.iterrows()))
pool.close()
end_values = list(res)
pool.join()

Neither Dask, nor the merge function with the apply used gave me better results. Not even optimizing the apply with Cython. In fact, this alternative with those two methods gave me memory error, when implementing the solution with Dask I had to generate several partitions, which degraded the performance as it had to perform many I/O operations.

The solution with Dask can be found in my other question.

https://en.xdnf.cn/q/72664.html

Related Q&A

Tensorflow: open a PIL.Image?

I have a script that obscures part of an image and runs it through a prediction net to see which parts of the image most strongly influence the tag prediction. To do this, I open a local image with PIL…

Django: Saving to DB from form example

It seems I had difficulty finding a good source/tutorial about saving data to the DB from a form. And as it progresses, I am slowly getting lost. I am new to Django, and please guide me. I am getting e…

eval(input()) in python 2to3

From the Python 2to3 doc:input:Converts input(prompt) to eval(input(prompt))I am currently trying to learn Python 3 after a few years working with Python 2. Can anybody please explain why the tool inse…

Post XML file using Python

Im new to Python and in need of some help. My aim is to send some XML with a post request to a URL, which is going to trigger a SMS being sent. I have a small XML document that I want to post to the UR…

Python TypeError: __init__() got multiple values for argument master

Trying to build a GUI in Python at the moment, and Im stuck at this part in particular. Every time I try to run my code it just throws the error TypeError: __init__() got multiple values for argument m…

How to suppress all warnings in window of executable file generated by pyinstaller

I have generated an executable file from a python file using pyinstaller. The program works how it is supposed to work but there is this warning message it appears in the window that I would like to hi…

Python requests gives me bad handshake error

Using Python requests like thisimport requests; requests.get(https://internal.site.no)gives me an error many have had;SSLError: ("bad handshake: Error([(SSL routines, SSL23_GET_SERVER_HELLO, sslv3…

PyCharm: Storing variables in memory to be able to run code from a checkpoint

Ive been searching everywhere for an answer to this but to no avail. I want to be able to run my code and have the variables stored in memory so that I can perhaps set a "checkpoint" which I …

Execute bash script from Python on Windows

I am trying to write a python script that will execute a bash script I have on my Windows machine. Up until now I have been using the Cygwin terminal so executing the bash script RunModels.scr has been…

Python regex convert youtube url to youtube video

Im making a regex so I can find youtube links (can be multiple) in a piece of HTML text posted by an user.Currently Im using the following regex to change http://www.youtube.com/watch?v=-JyZLS2IhkQ in…