Question 1

I'm trying to figure out how to apply a lambda function to multiple dataframes simultaneously, without first merging the data frames together. I am working with large data sets (>60MM records) and I need to be extra careful with memory management.

My hope is that there is a way to apply lambda to just the underlying dataframes so that I can avoid the cost of stitching them together first, and then dropping that intermediary dataframe from memory before I move on to the next step in the process.

I have experience dodging out of memory issues by using HDF5 based dataframes, but I'd rather try exploring something different first.

I have provided a toy problem to help demonstrate what I am talking about.

import numpy as np
import pandas as pd# Here's an arbitrary function to use with lambda
def someFunction(input1, input2, input3, input4):theSum = input1 + input2theAverage = (input1 + input2 + input3 + input4) / 4theProduct = input2 * input3 * input4return pd.Series({'Sum' : theSum, 'Average' : theAverage, 'Product' : theProduct})# Cook up some dummy dataframes
df1 = pd.DataFrame(np.random.randn(6,2),columns=list('AB'))
df2 = pd.DataFrame(np.random.randn(6,1),columns=list('C'))
df3 = pd.DataFrame(np.random.randn(6,1),columns=list('D'))# Currently, I merge the dataframes together and then apply the lambda function
dfConsolodated = pd.concat([df1, df2, df3], axis=1)# This works just fine, but merging the dataframes seems like an extra step
dfResults = dfConsolodated.apply(lambda x: someFunction(x['A'], x['B'], x['C'], x['D']), axis = 1)# I want to avoid the concat completely in order to be more efficient with memory. I am hoping for something like this:
# I am COMPLETELY making this syntax up for conceptual purposes, my apologies.
dfResultsWithoutConcat = [df1, df2, df3].apply(lambda x: someFunction(df1['A'], df1['B'], df2['C'], df3['D']), axis = 1)

Question 2

I know this question is kind of old, but here is a way I came up with. It is not nice, but it works.

The basic idea is to query the second dataframe inside the applied function. By using the name of the passed series, you can identfiy the column/index and use it to retrieve the needed value from the other dataframe(s).

def func(x, other):other_value = other.loc[x.name]return your_actual_method(x, other_value)result = df1.apply(lambda x: func(x, df2))

Pandas: Applying Lambda to Multiple Data Frames

Related Q&A

scipy.minimize - TypeError: numpy.float64 object is not callable running

Flask, not all arguments converted during string formatting

No module named objc

Incompatible types in assignment (expression has type List[nothing], variable has type (...)

How to convert XComArg to string values in Airflow 2.x?

Python dryscrape scrape page with cookies

Python retry using the tenacity module

How to write own logging methods for own logging levels

How to use pandas tz_convert to convert to multiple different time zones

virtualenv, python and subversion