Question 1

I have two dataframes with the same columns:

Dataframe 1:

          attr_1  attr_77 ... attr_8
userID                              
John      1.2501  2.4196  ... 1.7610
Charles   0.0000  1.0618  ... 1.4813
Genarito  2.7037  4.6707  ... 5.3583
Mark      9.2775  6.7638  ... 6.0071

Dataframe 2:

          attr_1  attr_77 ... attr_8
petID                              
Firulais  1.2501  2.4196  ... 1.7610
Connie    0.0000  1.0618  ... 1.4813
PopCorn   2.7037  4.6707  ... 5.3583

I want to generate a correlation and p-value dataframe of all posible combinations, this would be the result:

   userId   petID      Correlation    p-value
0  John     Firulais   0.091447       1.222927e-02
1  John     Connie     0.101687       5.313359e-03
2  John     PopCorn    0.178965       8.103919e-07
3  Charles  Firulais   -0.078460      3.167896e-02

The problem is that the cartesian product generates more than 3 million tuples. Taking minutes to finish. This is my code, I've written two alternatives:

First of all, initial DataFrames:

df1 = pd.DataFrame({'userID': ['John', 'Charles', 'Genarito', 'Mark'],'attr_1': [1.2501, 0.0, 2.7037, 9.2775],'attr_77': [2.4196, 1.0618, 4.6707, 6.7638],'attr_8': [1.7610, 1.4813, 5.3583, 6.0071]
}).set_index('userID')df2 = pd.DataFrame({'petID': ['Firulais', 'Connie', 'PopCorn'],'attr_1': [1.2501, 0.0, 2.7037],'attr_77': [2.4196, 1.0618, 4.6707],'attr_8': [1.7610, 1.4813, 5.3583]
}).set_index('petID')

Option 1:

# Pre-allocate space
df1_keys = df1.index
res_row_count = len(df1_keys) * df2.values.shape[0]
genes = np.empty(res_row_count, dtype='object')
mature_mirnas = np.empty(res_row_count, dtype='object')
coff = np.empty(res_row_count)
p_value = np.empty(res_row_count)i = 0
for df1_key in df1_keys:df1_values = df1.loc[df1_key, :].valuesfor df2_key in df2.index:df2_values = df2.loc[df2_key, :]pearson_res = pearsonr(df1_values, df2_values)users[i] = df1_keypets[i] = df2_keycoff[i] = pearson_res[0]p_value[i] = pearson_res[1]i += 1# After loop, creates the resulting Dataframe
return pd.DataFrame(data={'userID': users,'petID': pets,'Correlation': coff,'p-value': p_value
})

Option 2 ~~(slower)~~, from here:

# Makes a merge between all the tuples
def df_crossjoin(df1_file_path, df2_file_path):df1, df2 = prepare_df(df1_file_path, df2_file_path)df1['_tmpkey'] = 1df2['_tmpkey'] = 1res = pd.merge(df1, df2, on='_tmpkey').drop('_tmpkey', axis=1)res.index = pd.MultiIndex.from_product((df1.index, df2.index))df1.drop('_tmpkey', axis=1, inplace=True)df2.drop('_tmpkey', axis=1, inplace=True)return res# Computes Pearson Coefficient for all the tuples
def compute_pearson(row):values = np.split(row.values, 2)return pearsonr(values[0], values[1])result = df_crossjoin(mrna_file, mirna_file).apply(compute_pearson, axis=1)

Is there a faster way to solve such a problem with Pandas? Or I'll have no more option than parallelize the iterations?

Edit:

As the size of the dataframe increases the second option results in a better runtime, but It's still taking seconds to finish.

Thanks in advance

Question 2

Of all the alternatives tested, the one that gave me the best results was the following:

An iteration product was made with itertools.product().
All the iterations on both iterrows were performed on a Pool of parallel processes (using a map function).

To give it a little more performance, the function compute_row_cython was compiled with Cython as it is advised in this section of the Pandas documentation:

In the cython_modules.pyx file:

from scipy.stats import pearsonr
import numpy as npdef compute_row_cython(row):(df1_key, df1_values), (df2_key, df2_values) = rowcdef (double, double) pearsonr_res = pearsonr(df1_values.values, df2_values.values)return df1_key, df2_key, pearsonr_res[0], pearsonr_res[1]

Then I set up the setup.py:

from distutils.core import setup
from Cython.Build import cythonizesetup(name='Compiled Pearson',ext_modules=cythonize("cython_modules.pyx")

Finally I compiled it with: python setup.py build_ext --inplace

The final code was left, then:

import itertools
import multiprocessing
from cython_modules import compute_row_cythonNUM_CORES = multiprocessing.cpu_count() - 1pool = multiprocessing.Pool(NUM_CORES)
# Calls to Cython function defined in cython_modules.pyx
res = zip(*pool.map(compute_row_cython, itertools.product(df1.iterrows(), df2.iterrows()))
pool.close()
end_values = list(res)
pool.join()

Neither Dask, nor the merge function with the apply used gave me better results. Not even optimizing the apply with Cython. In fact, this alternative with those two methods gave me memory error, when implementing the solution with Dask I had to generate several partitions, which degraded the performance as it had to perform many I/O operations.

The solution with Dask can be found in my other question.

Optimizing cartesian product between two Pandas Dataframe

Edit:

Related Q&A

Tensorflow: open a PIL.Image?

Django: Saving to DB from form example

eval(input()) in python 2to3

Post XML file using Python

Python TypeError: init() got multiple values for argument master

How to suppress all warnings in window of executable file generated by pyinstaller

Python requests gives me bad handshake error

PyCharm: Storing variables in memory to be able to run code from a checkpoint

Execute bash script from Python on Windows

Python regex convert youtube url to youtube video