I have two example dataframes as follows:
df1 = pd.DataFrame({'Name': {0: 'John', 1: 'Bob', 2: 'Shiela'}, 'Degree': {0: 'Masters', 1: 'Graduate', 2: 'Graduate'}, 'Age': {0: 27, 1: 23, 2: 21}}) df2 = pd.DataFrame({'Name': {0: 'John S.', 1: 'Bob K.', 2: 'Frank'}, 'Degree': {0: 'Master', 1: 'Graduated', 2: 'Graduated'}, 'GPA': {0: 3, 1: 3.5, 2: 4}})
I want to merge them together based on two columns Name
and Degree
with fuzzy matching method to drive out possible duplicates. This is what I have realized with the help from reference here:
Apply fuzzy matching across a dataframe column and save results in a new column
from fuzzywuzzy import fuzz
from fuzzywuzzy import processcompare = pd.MultiIndex.from_product([df1['Name'],df2['Name']]).to_series()def metrics(tup):return pd.Series([fuzz.ratio(*tup),fuzz.token_sort_ratio(*tup)],['ratio', 'token'])
compare.apply(metrics)compare.apply(metrics).unstack().idxmax().unstack(0)compare.apply(metrics).unstack(0).idxmax().unstack(0)
Let's say fuzz.ratio of one's Name
and Degree
both are higher than 80 we consider they are same person. And taken Name
and Degree
from df1 as default. How can I get a following expected result? Thanks.
df = df1.merge(df2, on = ['Name', 'Degree'], how = 'outer')Name Degree Age GPA duplicatedName duplicatedDegree
0 John Masters 27.0 3.0 John S. Master
1 Bob Graduate 23.0 3.5 Bob K. Graduated
2 Shiela Graduate 21.0 NaN NaN Graduated
3 Frank Graduated NaN 4.0 NaN Graduate