Pandas: Approximate join on one column, exact match on other columns

2024/11/15 13:58:37

I have two pandas dataframes I want to join/merge exactly on a number of columns (say 3) and approximately, i.e nearest neighbour, on one (date) column. I also want to return the difference (days) between them. Each dataset is about 50,000 rows long. I'm most interested in an inner join, but the “leftovers” are also interesting if not too hard to get hold of. Most of the “exact match” observations will exist multiple times in each data frame.

I've been trying to use difflib.get_close_matches on the concatenation of all of them as strings (which is stupid, I know!) but is does not always give exact matches. I suppose I need to loop over the exact matches first and then find nearest matches within this group, but I just can't seem to get it right...

The dataframes look something like:

df1 = pd.DataFrame({'index': ['a1','a2','a3','a4'], 'col1': ['1232','432','432','123'], 'col2': ['asd','dsa12','dsa12','asd2'], 'col3': ['1','2','2','3'], 'date': ['2010-01-23','2016-05-20','2010-06-20','2008-10-21'],}).set_index('index')df1
Out[430]: col1   col2 col3        date
index                              
a1     1232    asd    1  2010-01-23
a2      432  dsa12    2  2016-05-20
a3      432  dsa12    2  2010-06-20
a4      123   asd2    3  2008-10-21df2 = pd.DataFrame({'index': ['b1','b2','b3','b4'], 'col1': ['132','432','432','123'], 'col2': ['asd','dsa12','dsa12','sd2'], 'col3': ['1','2','2','3'], 'date': ['2010-01-23','2016-05-23','2010-06-10','2008-10-21'],}).set_index('index')df2
Out[434]: col1   col2 col3        date
index                             
b1     132    asd    1  2010-01-23
b2     432  dsa12    2  2016-05-23
b3     432  dsa12    2  2010-06-10
b4     123    sd2    3  2008-10-21

In the end I want something like:

       col1   col2 col3        date diff match_index
index                              
a1     1232    asd    1  2010-01-23  nan         nan
a2      432  dsa12    2  2016-05-20   -3          b2
a3      432  dsa12    2  2010-06-20   10          b3
a4      123   asd2    3  2008-10-21  nan         nan
a5      123    sd2    3  2008-10-21  nan          b4

or if it's way easier with just an inner join I'd like:

       col1   col2 col3        date diff match_index
index                                                     
a2      432  dsa12    2  2016-05-20   -3          b2
a3      432  dsa12    2  2010-06-20   10          b3
Answer

I am not sure if this suits. It achieves more or less what you want but does not actually perform a merge. It follows the same idea as this question except instead of subsetting the df1 based on only one column, here we match on multiple columns using a groupby and do it on both dataframes. IF you do want to explicitly include the merge command and are happy with an inner join then check the very bottom of the answer, it includes a snippet for that.

import pandas as pd
from sklearn.neighbors import NearestNeighborsdef find_nearest(group, df2, groupname):try:match = df2.groupby(groupname).get_group(group.name)match['date'] = pd.to_datetime(match.date, unit = 'D')nbrs = NearestNeighbors(1).fit(match['date'].values[:, None])dist, ind = nbrs.kneighbors(group['date'].values[:, None])group['date1'] = group['date']group['date'] = match['date'].values[ind.ravel()]group['diff'] = (group['date1']-group['date'])group['match_index'] = match.index[ind.ravel()]return groupexcept KeyError:return group#change dates from string to datetimedf1['date'] = pd.to_datetime(df1.date, unit = 'D')df2['date'] = pd.to_datetime(df2.date, unit = 'D')#find closest dates and differenceskeys = ['col1', 'col2', 'col3']df1_mod = df1.groupby(keys).apply(find_nearest, df2, keys)#fill unmatched dates df1_mod.date1.fillna(df1_mod.date, inplace=True)df2_mod = df2.groupby(keys).apply(find_nearest, df1, keys) df2_mod.date1.fillna(df2_mod.date, inplace=True)#drop original column df1_mod.drop('date', inplace=True, axis=1)df1_mod.rename(columns = {'date1':'date'}, inplace=True)df2_mod.drop('date', inplace=True, axis=1)df2_mod.rename(columns = {'date1':'date'}, inplace=True)df2_mod['diff'] = -df2_mod['diff']#drop redundant valuesdf2_mod.drop(df2_mod[df2_mod.match_index.str.len()>0].index, inplace=True)#merge the two df_final = pd.merge(df1_mod, df2_mod, how='outer')

This yields the following result:

In [349]: df_final
Out[349]:col1   col2 col3       date    diff match_index
0  1232    asd    1 2010-01-23     NaT         NaN
1   432  dsa12    2 2016-05-20 -3 days          b2
2   432  dsa12    2 2010-06-20 10 days          b3
3   123   asd2    3 2008-10-21     NaT         NaN
4   132    asd    1 2010-01-23     NaT         NaN
5   123    sd2    3 2008-10-21     NaT         NaN

With the merge command:

In [208]: pd.merge(df1_mod, df2.drop('date', axis=1), on=['col1', 'col2', 'col3']).drop_duplicates()
Out[208]:col1   col2 col3       date    diff match_index
0  432  dsa12    2 2016-05-20 -3 days          b2
2  432  dsa12    2 2010-06-20 10 days          b3

The case considered in the comments, namely:

df1 = pd.DataFrame({'index': ['a1','a2','a3','a4'], 'col1': ['1232','1432','432','123'], 'col2': ['asd','dsa12','dsa12','asd2'], 'col3': ['1','2','2','3'], 'date': ['2010-01-23','2016-05-20','2010-06-20','2008-10-21'],}).set_index('index')

yields the following:

In [351]: df_final
Out[351]:col1   col2 col3       date    diff match_index
0  1232    asd    1 2010-01-23     NaT         NaN
1  1432  dsa12    2 2016-05-20     NaT         NaN
2   432  dsa12    2 2010-06-20 10 days          b3
3   123   asd2    3 2008-10-21     NaT         NaN
4   132    asd    1 2010-01-23     NaT         NaN
5   123    sd2    3 2008-10-21     NaT         NaN
https://en.xdnf.cn/q/71456.html

Related Q&A

Adding a variable in Content disposition response file name-python/django

I am looking to add a a variable into the file name section of my below python code so that the downloaded files name will change based on a users input upon download. So instead of "Data.xlsx&quo…

TkInter: understanding unbind function

Does TkInter unbind function prevents the widget on which it is applied from binding further events to the widget ?Clarification:Lets say I bound events to a canvas earlier in a prgram:canvas.bind(&qu…

Dynamically get dict elements via getattr?

I want to dynamically query which objects from a class I would like to retrieve. getattr seems like what I want, and it performs fine for top-level objects in the class. However, Id like to also specif…

How do I copy an image from the output in Jupyter Notebook 7+?

Ive been working with Jupyter Notebooks for quite a while. When working with visualisations, I like to copy the output image from a cell by right clicking the image and selecting "Copy Image"…

How to join 2 dataframe on year and month in Pandas?

I have 2 dataframe and I want to join them on the basis of month and year from a date without creating extra columns:example :df1 :date_1 value_1 2017-1-15 20 2017-1-31 30 2016-2-15 20df2…

Sorting Python Dictionary based on Key? [duplicate]

This question already has answers here:How do I sort a dictionary by key?(33 answers)Closed 10 years ago.I have created a python dictionary which has keys in this form :11, 10, 00, 01, 20, 21, 31, 30T…

Flask: Template in Blueprint Inherit from Template in App?

Im a total Flask/Jinja2 newbie, so maybe Im overlooking something obvious, but:Shouldnt Flask, out of the box, allow a template that exists in a blueprints templates/ folder to extend a base template d…

Equivalent of python2 chr(int) in python3

# python2 print(chr(174)) ?# python3 print(chr(174)) Im looking for the equivalent of chr() from python2. I believe this is due to python 3 returning unicode characters rather than ASCII.

How To Pagination Angular2 with Django Rest Framework API

I am trying to create a simple blog application using Angular2 with Django Rest Framework. I am implementing pagination in Django, but I do not know how to rendering it in Angular.API has the following…

Color percentage in image for Python using OpenCV

Im creating a code which can detect the percentage of green colour from an image. . I have a little experience with OpenCV but am still pretty new to image processing and would like some help with my c…