Pandas: Applying Lambda to Multiple Data Frames

2024/10/14 15:25:53

I'm trying to figure out how to apply a lambda function to multiple dataframes simultaneously, without first merging the data frames together. I am working with large data sets (>60MM records) and I need to be extra careful with memory management.

My hope is that there is a way to apply lambda to just the underlying dataframes so that I can avoid the cost of stitching them together first, and then dropping that intermediary dataframe from memory before I move on to the next step in the process.

I have experience dodging out of memory issues by using HDF5 based dataframes, but I'd rather try exploring something different first.

I have provided a toy problem to help demonstrate what I am talking about.

import numpy as np
import pandas as pd# Here's an arbitrary function to use with lambda
def someFunction(input1, input2, input3, input4):theSum = input1 + input2theAverage = (input1 + input2 + input3 + input4) / 4theProduct = input2 * input3 * input4return pd.Series({'Sum' : theSum, 'Average' : theAverage, 'Product' : theProduct})# Cook up some dummy dataframes
df1 = pd.DataFrame(np.random.randn(6,2),columns=list('AB'))
df2 = pd.DataFrame(np.random.randn(6,1),columns=list('C'))
df3 = pd.DataFrame(np.random.randn(6,1),columns=list('D'))# Currently, I merge the dataframes together and then apply the lambda function
dfConsolodated = pd.concat([df1, df2, df3], axis=1)# This works just fine, but merging the dataframes seems like an extra step
dfResults = dfConsolodated.apply(lambda x: someFunction(x['A'], x['B'], x['C'], x['D']), axis = 1)# I want to avoid the concat completely in order to be more efficient with memory. I am hoping for something like this:
# I am COMPLETELY making this syntax up for conceptual purposes, my apologies.
dfResultsWithoutConcat = [df1, df2, df3].apply(lambda x: someFunction(df1['A'], df1['B'], df2['C'], df3['D']), axis = 1)
Answer

I know this question is kind of old, but here is a way I came up with. It is not nice, but it works.

The basic idea is to query the second dataframe inside the applied function. By using the name of the passed series, you can identfiy the column/index and use it to retrieve the needed value from the other dataframe(s).

def func(x, other):other_value = other.loc[x.name]return your_actual_method(x, other_value)result = df1.apply(lambda x: func(x, df2))
https://en.xdnf.cn/q/69400.html

Related Q&A

scipy.minimize - TypeError: numpy.float64 object is not callable running

Running the scipy.minimize function "I get TypeError: numpy.float64 object is not callable". Specifically during the execution of:.../scipy/optimize/optimize.py", line 292, in function_w…

Flask, not all arguments converted during string formatting

Try to create a register page for my app. I am using Flask framework and MySQL db from pythonanywhere.com. @app.route(/register/, methods=["GET","POST"]) def register_page(): try:f…

No module named objc

Im trying to use cocoa-python with Xcode but it always calls up the error:Traceback (most recent call last):File "main.py", line 10, in <module>import objc ImportError: No module named …

Incompatible types in assignment (expression has type List[nothing], variable has type (...)

Consider the following self-contained example:from typing import List, UnionT_BENCODED_LIST = Union[List[bytes], List[List[bytes]]] ret: T_BENCODED_LIST = []When I test it with mypy, I get the followin…

How to convert XComArg to string values in Airflow 2.x?

Code: from airflow.models import BaseOperator from airflow.utils.decorators import apply_defaults from airflow.providers.google.cloud.hooks.gcs import GCSHookclass GCSUploadOperator(BaseOperator):@appl…

Python dryscrape scrape page with cookies

I wanna get some data from site, which requires loggin in. I log in by requestsurl = "http://example.com" response = requests.get(url, {"email":"[email protected]", "…

Python retry using the tenacity module

Im having having difficulty getting the tenacity library to work as expected. The retry in the following test doesnt trigger at all. I would expect a retry every 5 seconds and for the log file to refle…

How to write own logging methods for own logging levels

Hi I would like to extend my logger (taken by logging.getLogger("rrcheck")) with my own methods like: def warnpfx(...):How to do it best? My original wish is to have a root logger writing …

How to use pandas tz_convert to convert to multiple different time zones

I have some data as shown below with hour in UTC. I want to create a new column named local_hour based on time_zone. How can I do that? It seems like pandas tz_convert does not allow a column or panda…

virtualenv, python and subversion

Im trying to use the python subversion SWIG libraries in a virtualenv --no-site-packages environment. How can I make this work?