Question 1

I have two pandas dataframes I want to join/merge exactly on a number of columns (say 3) and approximately, i.e nearest neighbour, on one (date) column. I also want to return the difference (days) between them. Each dataset is about 50,000 rows long. I'm most interested in an inner join, but the “leftovers” are also interesting if not too hard to get hold of. Most of the “exact match” observations will exist multiple times in each data frame.

I've been trying to use difflib.get_close_matches on the concatenation of all of them as strings (which is stupid, I know!) but is does not always give exact matches. I suppose I need to loop over the exact matches first and then find nearest matches within this group, but I just can't seem to get it right...

The dataframes look something like:

df1 = pd.DataFrame({'index': ['a1','a2','a3','a4'], 'col1': ['1232','432','432','123'], 'col2': ['asd','dsa12','dsa12','asd2'], 'col3': ['1','2','2','3'], 'date': ['2010-01-23','2016-05-20','2010-06-20','2008-10-21'],}).set_index('index')df1
Out[430]: col1   col2 col3        date
index                              
a1     1232    asd    1  2010-01-23
a2      432  dsa12    2  2016-05-20
a3      432  dsa12    2  2010-06-20
a4      123   asd2    3  2008-10-21df2 = pd.DataFrame({'index': ['b1','b2','b3','b4'], 'col1': ['132','432','432','123'], 'col2': ['asd','dsa12','dsa12','sd2'], 'col3': ['1','2','2','3'], 'date': ['2010-01-23','2016-05-23','2010-06-10','2008-10-21'],}).set_index('index')df2
Out[434]: col1   col2 col3        date
index                             
b1     132    asd    1  2010-01-23
b2     432  dsa12    2  2016-05-23
b3     432  dsa12    2  2010-06-10
b4     123    sd2    3  2008-10-21

In the end I want something like:

       col1   col2 col3        date diff match_index
index                              
a1     1232    asd    1  2010-01-23  nan         nan
a2      432  dsa12    2  2016-05-20   -3          b2
a3      432  dsa12    2  2010-06-20   10          b3
a4      123   asd2    3  2008-10-21  nan         nan
a5      123    sd2    3  2008-10-21  nan          b4

or if it's way easier with just an inner join I'd like:

       col1   col2 col3        date diff match_index
index                                                     
a2      432  dsa12    2  2016-05-20   -3          b2
a3      432  dsa12    2  2010-06-20   10          b3

Question 2

I am not sure if this suits. It achieves more or less what you want but does not actually perform a merge. It follows the same idea as this question except instead of subsetting the df1 based on only one column, here we match on multiple columns using a groupby and do it on both dataframes. IF you do want to explicitly include the merge command and are happy with an inner join then check the very bottom of the answer, it includes a snippet for that.

import pandas as pd
from sklearn.neighbors import NearestNeighborsdef find_nearest(group, df2, groupname):try:match = df2.groupby(groupname).get_group(group.name)match['date'] = pd.to_datetime(match.date, unit = 'D')nbrs = NearestNeighbors(1).fit(match['date'].values[:, None])dist, ind = nbrs.kneighbors(group['date'].values[:, None])group['date1'] = group['date']group['date'] = match['date'].values[ind.ravel()]group['diff'] = (group['date1']-group['date'])group['match_index'] = match.index[ind.ravel()]return groupexcept KeyError:return group#change dates from string to datetimedf1['date'] = pd.to_datetime(df1.date, unit = 'D')df2['date'] = pd.to_datetime(df2.date, unit = 'D')#find closest dates and differenceskeys = ['col1', 'col2', 'col3']df1_mod = df1.groupby(keys).apply(find_nearest, df2, keys)#fill unmatched dates df1_mod.date1.fillna(df1_mod.date, inplace=True)df2_mod = df2.groupby(keys).apply(find_nearest, df1, keys) df2_mod.date1.fillna(df2_mod.date, inplace=True)#drop original column df1_mod.drop('date', inplace=True, axis=1)df1_mod.rename(columns = {'date1':'date'}, inplace=True)df2_mod.drop('date', inplace=True, axis=1)df2_mod.rename(columns = {'date1':'date'}, inplace=True)df2_mod['diff'] = -df2_mod['diff']#drop redundant valuesdf2_mod.drop(df2_mod[df2_mod.match_index.str.len()>0].index, inplace=True)#merge the two df_final = pd.merge(df1_mod, df2_mod, how='outer')

This yields the following result:

In [349]: df_final
Out[349]:col1   col2 col3       date    diff match_index
0  1232    asd    1 2010-01-23     NaT         NaN
1   432  dsa12    2 2016-05-20 -3 days          b2
2   432  dsa12    2 2010-06-20 10 days          b3
3   123   asd2    3 2008-10-21     NaT         NaN
4   132    asd    1 2010-01-23     NaT         NaN
5   123    sd2    3 2008-10-21     NaT         NaN

With the merge command:

In [208]: pd.merge(df1_mod, df2.drop('date', axis=1), on=['col1', 'col2', 'col3']).drop_duplicates()
Out[208]:col1   col2 col3       date    diff match_index
0  432  dsa12    2 2016-05-20 -3 days          b2
2  432  dsa12    2 2010-06-20 10 days          b3

The case considered in the comments, namely:

df1 = pd.DataFrame({'index': ['a1','a2','a3','a4'], 'col1': ['1232','1432','432','123'], 'col2': ['asd','dsa12','dsa12','asd2'], 'col3': ['1','2','2','3'], 'date': ['2010-01-23','2016-05-20','2010-06-20','2008-10-21'],}).set_index('index')

yields the following:

In [351]: df_final
Out[351]:col1   col2 col3       date    diff match_index
0  1232    asd    1 2010-01-23     NaT         NaN
1  1432  dsa12    2 2016-05-20     NaT         NaN
2   432  dsa12    2 2010-06-20 10 days          b3
3   123   asd2    3 2008-10-21     NaT         NaN
4   132    asd    1 2010-01-23     NaT         NaN
5   123    sd2    3 2008-10-21     NaT         NaN

Pandas: Approximate join on one column, exact match on other columns

Related Q&A

Adding a variable in Content disposition response file name-python/django

TkInter: understanding unbind function

Dynamically get dict elements via getattr?

How do I copy an image from the output in Jupyter Notebook 7+?

How to join 2 dataframe on year and month in Pandas?

Sorting Python Dictionary based on Key? [duplicate]

Flask: Template in Blueprint Inherit from Template in App?

Equivalent of python2 chr(int) in python3

How To Pagination Angular2 with Django Rest Framework API

Color percentage in image for Python using OpenCV