Merge dataframes on multiple columns with fuzzy match in Python

2024/9/22 5:29:41

I have two example dataframes as follows:

df1 = pd.DataFrame({'Name': {0: 'John', 1: 'Bob', 2: 'Shiela'}, 'Degree': {0: 'Masters', 1: 'Graduate', 2: 'Graduate'}, 'Age': {0: 27, 1: 23, 2: 21}}) df2 = pd.DataFrame({'Name': {0: 'John S.', 1: 'Bob K.', 2: 'Frank'}, 'Degree': {0: 'Master', 1: 'Graduated', 2: 'Graduated'}, 'GPA': {0: 3, 1: 3.5, 2: 4}}) 

I want to merge them together based on two columns Name and Degree with fuzzy matching method to drive out possible duplicates. This is what I have realized with the help from reference here: Apply fuzzy matching across a dataframe column and save results in a new column

from fuzzywuzzy import fuzz
from fuzzywuzzy import processcompare = pd.MultiIndex.from_product([df1['Name'],df2['Name']]).to_series()def metrics(tup):return pd.Series([fuzz.ratio(*tup),fuzz.token_sort_ratio(*tup)],['ratio', 'token'])
compare.apply(metrics)compare.apply(metrics).unstack().idxmax().unstack(0)compare.apply(metrics).unstack(0).idxmax().unstack(0)

Let's say fuzz.ratio of one's Name and Degree both are higher than 80 we consider they are same person. And taken Name and Degree from df1 as default. How can I get a following expected result? Thanks.

df = df1.merge(df2, on = ['Name', 'Degree'], how = 'outer')Name     Degree   Age  GPA    duplicatedName   duplicatedDegree 
0     John    Masters  27.0  3.0         John S.          Master
1      Bob   Graduate  23.0  3.5          Bob K.         Graduated
2   Shiela   Graduate  21.0  NaN          NaN            Graduated
3    Frank  Graduated   NaN  4.0          NaN            Graduate
Answer

I think ratio should be lower, for me working 60. Create Series with list comprehension, filter by N and get maximal value. Last map with fillna and last merge:

from fuzzywuzzy import fuzz
from fuzzywuzzy import process
from  itertools import productN = 60
names = {tup: fuzz.ratio(*tup) for tup in product(df1['Name'].tolist(), df2['Name'].tolist())}s1 = pd.Series(names)
s1 = s1[s1 > N]
s1 = s1[s1.groupby(level=0).idxmax()]print (s1)
John S.    John
Bob K.      Bob
dtype: objectdegrees = {tup: fuzz.ratio(*tup) for tup in product(df1['Degree'].tolist(), df2['Degree'].tolist())}s2 = pd.Series(degrees)
s2 = s2[s2 > N]
s2 = s2[s2.groupby(level=0).idxmax()]
print (s2)
Graduated    Graduate
Master        Masters
dtype: objectdf2['Name'] = df2['Name'].map(s1).fillna(df2['Name'])
df2['Degree'] = df2['Degree'].map(s2).fillna(df2['Degree'])
#generally slowier alternative
#df2['Name'] = df2['Name'].replace(s1)
#df2['Degree'] = df2['Degree'].replace(s2)

df = df1.merge(df2, on = ['Name', 'Degree'], how = 'outer')
print (df)Name    Degree   Age  GPA
0    John   Masters  27.0  3.0
1     Bob  Graduate  23.0  3.5
2  Shiela  Graduate  21.0  NaN
3   Frank  Graduate   NaN  4.0
https://en.xdnf.cn/q/71980.html

Related Q&A

Prevent Celery Beat from running the same task

I have a scheduled celery running tasks every 30 seconds. I have one that runs as task daily, and another one that runs weekly on a user specified time and day of the week. It checks for the "star…

Tastypie with application/x-www-form-urlencoded

Im having a bit of difficulty figuring out what my next steps should be. I am using tastypie to create an API for my web application. From another application, specifically ifbyphone.com, I am receivin…

Check for areas that are too thin in an image

I am trying to validate black and white images (more of a clipart images - not photos) for an engraving machine. One of the major things I need to take into consideration is the size of areas (or width…

Sort Python Dictionary by Absolute Value of Values

Trying to build off of the advice on sorting a Python dictionary here, how would I go about printing a Python dictionary in sorted order based on the absolute value of the values?I have tried:sorted(m…

impyla hangs when connecting to HiveServer2

Im writing some ETL flows in Python that, for part of the process, use Hive. Clouderas impyla client, according to the documentation, works with both Impala and Hive.In my experience, the client worked…

django prevent delete of model instance

I have a models.Model subclass which represents a View on my mysql database (ie managed=False).However, when running my unit tests, I get:DatabaseError: (1288, The target table my_view_table of the DEL…

suppress/redirect stderr when calling python webrowser

I have a python program that opens several urls in seperate tabs in a new browser window, however when I run the program from the command line and open the browser using webbrowser.open_new(url)The std…

Bokeh logarithmic scale for Bar chart

I know that I can do logarithmic scales with bokeh using the plotting API:p = figure(tools="pan,box_zoom,reset,previewsave",y_axis_type="log", y_range=[0.001, 10**22], title="l…

Can I control the way the CountVectorizer vectorizes the corpus in scikit learn?

I am working with a CountVectorizer from scikit learn, and Im possibly attempting to do some things that the object was not made for...but Im not sure.In terms of getting counts for occurrence:vocabula…

mod_wsgi process getting killed and django stops working

I have mod_wsgi running in daemon mode on a custom Linux build. I havent included any number for processes or threads in the apache config. Here is my config:WSGIDaemonProcess django user=admin WSGIPro…