Question 1

I have two example dataframes as follows:

df1 = pd.DataFrame({'Name': {0: 'John', 1: 'Bob', 2: 'Shiela'}, 'Degree': {0: 'Masters', 1: 'Graduate', 2: 'Graduate'}, 'Age': {0: 27, 1: 23, 2: 21}}) df2 = pd.DataFrame({'Name': {0: 'John S.', 1: 'Bob K.', 2: 'Frank'}, 'Degree': {0: 'Master', 1: 'Graduated', 2: 'Graduated'}, 'GPA': {0: 3, 1: 3.5, 2: 4}})

I want to merge them together based on two columns Name and Degree with fuzzy matching method to drive out possible duplicates. This is what I have realized with the help from reference here: Apply fuzzy matching across a dataframe column and save results in a new column

from fuzzywuzzy import fuzz
from fuzzywuzzy import processcompare = pd.MultiIndex.from_product([df1['Name'],df2['Name']]).to_series()def metrics(tup):return pd.Series([fuzz.ratio(*tup),fuzz.token_sort_ratio(*tup)],['ratio', 'token'])
compare.apply(metrics)compare.apply(metrics).unstack().idxmax().unstack(0)compare.apply(metrics).unstack(0).idxmax().unstack(0)

Let's say fuzz.ratio of one's Name and Degree both are higher than 80 we consider they are same person. And taken Name and Degree from df1 as default. How can I get a following expected result? Thanks.

df = df1.merge(df2, on = ['Name', 'Degree'], how = 'outer')Name     Degree   Age  GPA    duplicatedName   duplicatedDegree 
0     John    Masters  27.0  3.0         John S.          Master
1      Bob   Graduate  23.0  3.5          Bob K.         Graduated
2   Shiela   Graduate  21.0  NaN          NaN            Graduated
3    Frank  Graduated   NaN  4.0          NaN            Graduate

Question 2

I think ratio should be lower, for me working 60. Create Series with list comprehension, filter by N and get maximal value. Last map with fillna and last merge:

from fuzzywuzzy import fuzz
from fuzzywuzzy import process
from  itertools import productN = 60
names = {tup: fuzz.ratio(*tup) for tup in product(df1['Name'].tolist(), df2['Name'].tolist())}s1 = pd.Series(names)
s1 = s1[s1 > N]
s1 = s1[s1.groupby(level=0).idxmax()]print (s1)
John S.    John
Bob K.      Bob
dtype: objectdegrees = {tup: fuzz.ratio(*tup) for tup in product(df1['Degree'].tolist(), df2['Degree'].tolist())}s2 = pd.Series(degrees)
s2 = s2[s2 > N]
s2 = s2[s2.groupby(level=0).idxmax()]
print (s2)
Graduated    Graduate
Master        Masters
dtype: objectdf2['Name'] = df2['Name'].map(s1).fillna(df2['Name'])
df2['Degree'] = df2['Degree'].map(s2).fillna(df2['Degree'])
#generally slowier alternative
#df2['Name'] = df2['Name'].replace(s1)
#df2['Degree'] = df2['Degree'].replace(s2)

df = df1.merge(df2, on = ['Name', 'Degree'], how = 'outer')
print (df)Name    Degree   Age  GPA
0    John   Masters  27.0  3.0
1     Bob  Graduate  23.0  3.5
2  Shiela  Graduate  21.0  NaN
3   Frank  Graduate   NaN  4.0

Merge dataframes on multiple columns with fuzzy match in Python

Related Q&A

Prevent Celery Beat from running the same task

Tastypie with application/x-www-form-urlencoded

Check for areas that are too thin in an image

Sort Python Dictionary by Absolute Value of Values

impyla hangs when connecting to HiveServer2

django prevent delete of model instance

suppress/redirect stderr when calling python webrowser

Bokeh logarithmic scale for Bar chart

Can I control the way the CountVectorizer vectorizes the corpus in scikit learn?

mod_wsgi process getting killed and django stops working