Pandas - Update/Merge 2 Dataframes based on multiple matching column values

2024/9/20 12:26:00

I have 2 dataframes left_df and right-df, which both have 20 columns with identical names and dtypes. right_df also has 2 additional columns with unique values on every row.

I want to update rows in right_df with ALL the values from left_df where the values in ALL columns in a list of a subset of columns, matching_cols = ['col_1', 'col_3', 'col_10', 'col_12'] are identical in both dataframes. The values in the additional 2 unique columns in right_df should be preserved.

Ideally, I want to also drop those rows from left_df in the same command, or as the next command if this isn't possible. I need to do this process more than once, matching on several different lists of columns, with the left_df dropping matched rows each loop, until eventually no further matches are found.

An acceptable alternative would be any method to create a new dataframe new_df containing the set of rows where all specified columns in the list matching_cols match, with values from left_df in the first 20 columns and values from right_df in the remaining 2 columns.

I don't care about preserving the indices at any point in either dataframe, I am importing them to SQL after this and will reindex them on one of the 2 right_df values at the end.

New to Pandas and can't determine what method to use, have tried variations of .merge, .join, .update, etc, but can't seem to specify to only update when my desired column values all match, or how to drop those rows/export them to a new df.

Update: Added pseudocode below:

For a left_df as:

left_df = pd.DataFrame({'col_0': ['0', '1', '2', '3', '4', '5'],'col_1': ['A', 'B', 'C', 'D', 'E', 'F'],'col_2': ['new', 'new', 'new', 'new', 'new', 'new'],'col_3': ['new', 'new', 'new', 'new', 'new', 'new'],'col_4': ['new', 'new', 'new', 'new', 'new', 'new'],'col_5': ['new', 'new', 'new', 'new', 'new', 'new'],'col_6': ['new', 'new', 'new', 'new', 'new', 'new'],'col_7': ['new', 'new', 'new', 'new', 'new', 'new'],                })

and a right_df as:

right_df = pd.DataFrame({'col_0': ['0', '1', '2', '3', '4', '5'],'col_1': ['A', 'B', 'C', 'X', 'E', 'F'],'col_2': ['old', 'old', 'old', 'old', 'old', 'old'],'col_3': ['old', 'old', 'old', 'old', 'old', 'old'],'col_4': ['old', 'old', 'old', 'old', 'old', 'old'],'col_5': ['old', 'old', 'old', 'old', 'old', 'old'],'col_6': ['old', 'old', 'old', 'old', 'old', 'old'],'col_7': ['old', 'old', 'old', 'old', 'old', 'old'],'col_8': ['uid_0', 'uid_1', 'uid_2', 'uid_3', 'uid_4', 'uid_5'],'col_9': ['uid_a', 'uid_b', 'uid_c', 'uid_d', 'uid_e', 'uid_f'],                })

Where matching_cols = ['col_0', 'col_1']

I want to get the following result either as a new dataframe or in-place on right_df (note that col_1 doesn't match on row 3, so is not changed)

  col_0 col_1 col_2 col_3 col_4 col_5 col_6 col_7  col_8  col_9
0     0     A   new   new   new   new   new   new  uid_0  uid_a
1     1     B   new   new   new   new   new   new  uid_1  uid_b
2     2     C   new   new   new   new   new   new  uid_2  uid_c
3     3     X   old   old   old   old   old   old  uid_3  uid_d
4     4     E   new   new   new   new   new   new  uid_4  uid_e
5     5     F   new   new   new   new   new   new  uid_5  uid_f
Answer

Worked it out thanks to this post and the Pandas documentation:

First, it's a .merge I need, and I specify the suffixes as '_r' for only the columns to be copied from the right_df / for the old values I'm updating:

merged_df = pd.merge(left_df, right_df, on=['col_0', 'col_1'], suffixes=(None, '_r'))

This yields a new dataframe with rows containing both the new and old columns, only for rows in each dataframe where the values in columns on=['col_0', 'col_1'] are a match. Then I drop the "old" columns by using a regex filter on the text '_r':

merged_df.drop(list(merged_df.filter(regex = '_r')), axis=1, inplace=True)

This yields a dataframe with only the "modified" rows and no unmodified rows, which is close enough for what I need.

  col_0 col_1 col_2 col_3 col_4 col_5 col_6 col_7  col_8  col_9
0     0     A   new   new   new   new   new   new  uid_0  uid_a
1     1     B   new   new   new   new   new   new  uid_1  uid_b
2     2     C   new   new   new   new   new   new  uid_2  uid_c
3     4     E   new   new   new   new   new   new  uid_4  uid_e
4     5     F   new   new   new   new   new   new  uid_5  uid_f
https://en.xdnf.cn/q/119718.html

Related Q&A

How do I fix scrapy Unsupported URL scheme error?

I collect url from command python and then insert it into start_urls from flask import Flask, jsonify, request import scrapy import subprocessclass ClassSpider(scrapy.Spider):name = mySpider#sta…

comparing two timeseries dataframes based on some conditions in pandas

I have two timeseries dataframes df1 and df2: df1 = pd.DataFrame({date_1:[10/11/2017 0:00,10/11/2017 03:00,10/11/2017 06:00,10/11/2017 09:00],value_1:[5000,1500,np.nan,2000]})df1[date_1] = pd.to_dateti…

Game of Chance in Python 3.x?

I have this problem in my python code which is a coinflip game, the problem is that when It asks, "Heads or Tails?" and I just say 1 or Heads(same for 2 and Tails) without quotation marks an…

Count occurence of a word by ID in python

Following is the content of a file,My question is how to count the number of occurences for the word "optimus" for different IDs ID67 DATEUID Thank you for choosing Optimus prime. Please w…

ModuleNotFoundError: No module named plyer in Python

I am trying to write a program notify.py (location: desktop) that uses plyer library to get a notification on windows 10. I used pip install plyer and am using vs code to run the program but I get an e…

Floating point to 16 bit Twos Complement Binary, Python

so I think questions like this have been asked before but Im having quite a bit of trouble getting this implemented. Im dealing with CSV files that contain floating points between -1 and 1. All of thes…

Flag the first non zero column value with 1 and rest 0 having multiple columns

Please assist with the belowimport pandas as pd df = pd.DataFrame({Grp: [1,1,1,1,2,2,2,2,3,3,3,4,4,4], Org1: [x,x,y,y,z,y,z,z,x,y,y,z,x,x], Org2: [a,a,b,b,c,b,c,c,a,b,b,c,a,a], Value: [0,0,3,1,0,1,0,5,…

How to split up data from a column in a csv file into two separate output csv files?

I have a .csv file, e.g.:ID NAME CATEGORIES 1, x, AB 2, xx, AA 3, xxx, BAHow would I get this to form two output .csv files based on the category e.g.:File 1:ID NAME CATEGORY 1, x, A 2, xx, A 3, …

Discord.py spellcheck commands

Recently, I looked up Stack Overflow and found this code which can check for potential typos: from difflib import SequenceMatcher SequenceMatcher(None, "help", "hepl").ratio() # Ret…

Django Model Form doesnt seem to validate the BooleanField

In my model the validation is not validating for the boolean field, only one time product_field need to be checked , if two time checked raise validation error.product_field = models.BooleanField(defau…