I have two dataframes with the same columns:
Dataframe 1:
attr_1 attr_77 ... attr_8
userID
John 1.2501 2.4196 ... 1.7610
Charles 0.0000 1.0618 ... 1.4813
Genarito 2.7037 4.6707 ... 5.3583
Mark 9.2775 6.7638 ... 6.0071
Dataframe 2:
attr_1 attr_77 ... attr_8
petID
Firulais 1.2501 2.4196 ... 1.7610
Connie 0.0000 1.0618 ... 1.4813
PopCorn 2.7037 4.6707 ... 5.3583
I want to generate a correlation and p-value dataframe of all posible combinations, this would be the result:
userId petID Correlation p-value
0 John Firulais 0.091447 1.222927e-02
1 John Connie 0.101687 5.313359e-03
2 John PopCorn 0.178965 8.103919e-07
3 Charles Firulais -0.078460 3.167896e-02
The problem is that the cartesian product generates more than 3 million tuples. Taking minutes to finish. This is my code, I've written two alternatives:
First of all, initial DataFrames:
df1 = pd.DataFrame({'userID': ['John', 'Charles', 'Genarito', 'Mark'],'attr_1': [1.2501, 0.0, 2.7037, 9.2775],'attr_77': [2.4196, 1.0618, 4.6707, 6.7638],'attr_8': [1.7610, 1.4813, 5.3583, 6.0071]
}).set_index('userID')df2 = pd.DataFrame({'petID': ['Firulais', 'Connie', 'PopCorn'],'attr_1': [1.2501, 0.0, 2.7037],'attr_77': [2.4196, 1.0618, 4.6707],'attr_8': [1.7610, 1.4813, 5.3583]
}).set_index('petID')
Option 1:
# Pre-allocate space
df1_keys = df1.index
res_row_count = len(df1_keys) * df2.values.shape[0]
genes = np.empty(res_row_count, dtype='object')
mature_mirnas = np.empty(res_row_count, dtype='object')
coff = np.empty(res_row_count)
p_value = np.empty(res_row_count)i = 0
for df1_key in df1_keys:df1_values = df1.loc[df1_key, :].valuesfor df2_key in df2.index:df2_values = df2.loc[df2_key, :]pearson_res = pearsonr(df1_values, df2_values)users[i] = df1_keypets[i] = df2_keycoff[i] = pearson_res[0]p_value[i] = pearson_res[1]i += 1# After loop, creates the resulting Dataframe
return pd.DataFrame(data={'userID': users,'petID': pets,'Correlation': coff,'p-value': p_value
})
Option 2 (slower), from here:
# Makes a merge between all the tuples
def df_crossjoin(df1_file_path, df2_file_path):df1, df2 = prepare_df(df1_file_path, df2_file_path)df1['_tmpkey'] = 1df2['_tmpkey'] = 1res = pd.merge(df1, df2, on='_tmpkey').drop('_tmpkey', axis=1)res.index = pd.MultiIndex.from_product((df1.index, df2.index))df1.drop('_tmpkey', axis=1, inplace=True)df2.drop('_tmpkey', axis=1, inplace=True)return res# Computes Pearson Coefficient for all the tuples
def compute_pearson(row):values = np.split(row.values, 2)return pearsonr(values[0], values[1])result = df_crossjoin(mrna_file, mirna_file).apply(compute_pearson, axis=1)
Is there a faster way to solve such a problem with Pandas? Or I'll have no more option than parallelize the iterations?
Edit:
As the size of the dataframe increases the second option results in a better runtime, but It's still taking seconds to finish.
Thanks in advance