I'm trying to figure out how to apply a lambda function to multiple dataframes simultaneously, without first merging the data frames together. I am working with large data sets (>60MM records) and I need to be extra careful with memory management.
My hope is that there is a way to apply lambda to just the underlying dataframes so that I can avoid the cost of stitching them together first, and then dropping that intermediary dataframe from memory before I move on to the next step in the process.
I have experience dodging out of memory issues by using HDF5 based dataframes, but I'd rather try exploring something different first.
I have provided a toy problem to help demonstrate what I am talking about.
import numpy as np
import pandas as pd# Here's an arbitrary function to use with lambda
def someFunction(input1, input2, input3, input4):theSum = input1 + input2theAverage = (input1 + input2 + input3 + input4) / 4theProduct = input2 * input3 * input4return pd.Series({'Sum' : theSum, 'Average' : theAverage, 'Product' : theProduct})# Cook up some dummy dataframes
df1 = pd.DataFrame(np.random.randn(6,2),columns=list('AB'))
df2 = pd.DataFrame(np.random.randn(6,1),columns=list('C'))
df3 = pd.DataFrame(np.random.randn(6,1),columns=list('D'))# Currently, I merge the dataframes together and then apply the lambda function
dfConsolodated = pd.concat([df1, df2, df3], axis=1)# This works just fine, but merging the dataframes seems like an extra step
dfResults = dfConsolodated.apply(lambda x: someFunction(x['A'], x['B'], x['C'], x['D']), axis = 1)# I want to avoid the concat completely in order to be more efficient with memory. I am hoping for something like this:
# I am COMPLETELY making this syntax up for conceptual purposes, my apologies.
dfResultsWithoutConcat = [df1, df2, df3].apply(lambda x: someFunction(df1['A'], df1['B'], df2['C'], df3['D']), axis = 1)