I have 2 dataframes left_df
and right-df
, which both have 20 columns with identical names and dtypes. right_df
also has 2 additional columns with unique values on every row.
I want to update rows in right_df
with ALL the values from left_df
where the values in ALL columns in a list of a subset of columns, matching_cols = ['col_1', 'col_3', 'col_10', 'col_12']
are identical in both dataframes. The values in the additional 2 unique columns in right_df
should be preserved.
Ideally, I want to also drop those rows from left_df
in the same command, or as the next command if this isn't possible. I need to do this process more than once, matching on several different lists of columns, with the left_df
dropping matched rows each loop, until eventually no further matches are found.
An acceptable alternative would be any method to create a new dataframe new_df
containing the set of rows where all specified columns in the list matching_cols
match, with values from left_df
in the first 20 columns and values from right_df
in the remaining 2 columns.
I don't care about preserving the indices at any point in either dataframe, I am importing them to SQL after this and will reindex them on one of the 2 right_df
values at the end.
New to Pandas and can't determine what method to use, have tried variations of .merge
, .join
, .update
, etc, but can't seem to specify to only update when my desired column values all match, or how to drop those rows/export them to a new df.
Update: Added pseudocode below:
For a left_df
as:
left_df = pd.DataFrame({'col_0': ['0', '1', '2', '3', '4', '5'],'col_1': ['A', 'B', 'C', 'D', 'E', 'F'],'col_2': ['new', 'new', 'new', 'new', 'new', 'new'],'col_3': ['new', 'new', 'new', 'new', 'new', 'new'],'col_4': ['new', 'new', 'new', 'new', 'new', 'new'],'col_5': ['new', 'new', 'new', 'new', 'new', 'new'],'col_6': ['new', 'new', 'new', 'new', 'new', 'new'],'col_7': ['new', 'new', 'new', 'new', 'new', 'new'], })
and a right_df as:
right_df = pd.DataFrame({'col_0': ['0', '1', '2', '3', '4', '5'],'col_1': ['A', 'B', 'C', 'X', 'E', 'F'],'col_2': ['old', 'old', 'old', 'old', 'old', 'old'],'col_3': ['old', 'old', 'old', 'old', 'old', 'old'],'col_4': ['old', 'old', 'old', 'old', 'old', 'old'],'col_5': ['old', 'old', 'old', 'old', 'old', 'old'],'col_6': ['old', 'old', 'old', 'old', 'old', 'old'],'col_7': ['old', 'old', 'old', 'old', 'old', 'old'],'col_8': ['uid_0', 'uid_1', 'uid_2', 'uid_3', 'uid_4', 'uid_5'],'col_9': ['uid_a', 'uid_b', 'uid_c', 'uid_d', 'uid_e', 'uid_f'], })
Where matching_cols = ['col_0', 'col_1']
I want to get the following result either as a new dataframe or in-place on right_df
(note that col_1
doesn't match on row 3, so is not changed)
col_0 col_1 col_2 col_3 col_4 col_5 col_6 col_7 col_8 col_9
0 0 A new new new new new new uid_0 uid_a
1 1 B new new new new new new uid_1 uid_b
2 2 C new new new new new new uid_2 uid_c
3 3 X old old old old old old uid_3 uid_d
4 4 E new new new new new new uid_4 uid_e
5 5 F new new new new new new uid_5 uid_f