I have two pandas dataframes I want to join/merge exactly on a number of columns (say 3) and approximately, i.e nearest neighbour, on one (date) column. I also want to return the difference (days) between them. Each dataset is about 50,000 rows long. I'm most interested in an inner join, but the “leftovers” are also interesting if not too hard to get hold of. Most of the “exact match” observations will exist multiple times in each data frame.
I've been trying to use difflib.get_close_matches on the concatenation of all of them as strings (which is stupid, I know!) but is does not always give exact matches. I suppose I need to loop over the exact matches first and then find nearest matches within this group, but I just can't seem to get it right...
The dataframes look something like:
df1 = pd.DataFrame({'index': ['a1','a2','a3','a4'], 'col1': ['1232','432','432','123'], 'col2': ['asd','dsa12','dsa12','asd2'], 'col3': ['1','2','2','3'], 'date': ['2010-01-23','2016-05-20','2010-06-20','2008-10-21'],}).set_index('index')df1
Out[430]: col1 col2 col3 date
index
a1 1232 asd 1 2010-01-23
a2 432 dsa12 2 2016-05-20
a3 432 dsa12 2 2010-06-20
a4 123 asd2 3 2008-10-21df2 = pd.DataFrame({'index': ['b1','b2','b3','b4'], 'col1': ['132','432','432','123'], 'col2': ['asd','dsa12','dsa12','sd2'], 'col3': ['1','2','2','3'], 'date': ['2010-01-23','2016-05-23','2010-06-10','2008-10-21'],}).set_index('index')df2
Out[434]: col1 col2 col3 date
index
b1 132 asd 1 2010-01-23
b2 432 dsa12 2 2016-05-23
b3 432 dsa12 2 2010-06-10
b4 123 sd2 3 2008-10-21
In the end I want something like:
col1 col2 col3 date diff match_index
index
a1 1232 asd 1 2010-01-23 nan nan
a2 432 dsa12 2 2016-05-20 -3 b2
a3 432 dsa12 2 2010-06-20 10 b3
a4 123 asd2 3 2008-10-21 nan nan
a5 123 sd2 3 2008-10-21 nan b4
or if it's way easier with just an inner join I'd like:
col1 col2 col3 date diff match_index
index
a2 432 dsa12 2 2016-05-20 -3 b2
a3 432 dsa12 2 2010-06-20 10 b3
I am not sure if this suits. It achieves more or less what you want but does not actually perform a merge. It follows the same idea as this question except instead of subsetting the df1
based on only one column, here we match on multiple columns using a groupby
and do it on both dataframes. IF you do want to explicitly include the merge
command and are happy with an inner join then check the very bottom of the answer, it includes a snippet for that.
import pandas as pd
from sklearn.neighbors import NearestNeighborsdef find_nearest(group, df2, groupname):try:match = df2.groupby(groupname).get_group(group.name)match['date'] = pd.to_datetime(match.date, unit = 'D')nbrs = NearestNeighbors(1).fit(match['date'].values[:, None])dist, ind = nbrs.kneighbors(group['date'].values[:, None])group['date1'] = group['date']group['date'] = match['date'].values[ind.ravel()]group['diff'] = (group['date1']-group['date'])group['match_index'] = match.index[ind.ravel()]return groupexcept KeyError:return group#change dates from string to datetimedf1['date'] = pd.to_datetime(df1.date, unit = 'D')df2['date'] = pd.to_datetime(df2.date, unit = 'D')#find closest dates and differenceskeys = ['col1', 'col2', 'col3']df1_mod = df1.groupby(keys).apply(find_nearest, df2, keys)#fill unmatched dates df1_mod.date1.fillna(df1_mod.date, inplace=True)df2_mod = df2.groupby(keys).apply(find_nearest, df1, keys) df2_mod.date1.fillna(df2_mod.date, inplace=True)#drop original column df1_mod.drop('date', inplace=True, axis=1)df1_mod.rename(columns = {'date1':'date'}, inplace=True)df2_mod.drop('date', inplace=True, axis=1)df2_mod.rename(columns = {'date1':'date'}, inplace=True)df2_mod['diff'] = -df2_mod['diff']#drop redundant valuesdf2_mod.drop(df2_mod[df2_mod.match_index.str.len()>0].index, inplace=True)#merge the two df_final = pd.merge(df1_mod, df2_mod, how='outer')
This yields the following result:
In [349]: df_final
Out[349]:col1 col2 col3 date diff match_index
0 1232 asd 1 2010-01-23 NaT NaN
1 432 dsa12 2 2016-05-20 -3 days b2
2 432 dsa12 2 2010-06-20 10 days b3
3 123 asd2 3 2008-10-21 NaT NaN
4 132 asd 1 2010-01-23 NaT NaN
5 123 sd2 3 2008-10-21 NaT NaN
With the merge command:
In [208]: pd.merge(df1_mod, df2.drop('date', axis=1), on=['col1', 'col2', 'col3']).drop_duplicates()
Out[208]:col1 col2 col3 date diff match_index
0 432 dsa12 2 2016-05-20 -3 days b2
2 432 dsa12 2 2010-06-20 10 days b3
The case considered in the comments, namely:
df1 = pd.DataFrame({'index': ['a1','a2','a3','a4'], 'col1': ['1232','1432','432','123'], 'col2': ['asd','dsa12','dsa12','asd2'], 'col3': ['1','2','2','3'], 'date': ['2010-01-23','2016-05-20','2010-06-20','2008-10-21'],}).set_index('index')
yields the following:
In [351]: df_final
Out[351]:col1 col2 col3 date diff match_index
0 1232 asd 1 2010-01-23 NaT NaN
1 1432 dsa12 2 2016-05-20 NaT NaN
2 432 dsa12 2 2010-06-20 10 days b3
3 123 asd2 3 2008-10-21 NaT NaN
4 132 asd 1 2010-01-23 NaT NaN
5 123 sd2 3 2008-10-21 NaT NaN