Full outer join of two or more data frames

2024/10/9 14:19:33

Given the following three Pandas data frames, I need to merge them similar to an SQL full outer join. Note that the key is multi-index type_N and id_N with N = 1,2,3:

import pandas as pdraw_data = {'type_1': [0, 1, 1,1],'id_1': ['3', '4', '5','5'],'name_1': ['Alex', 'Amy', 'Allen', 'Jane']}
df_a = pd.DataFrame(raw_data, columns = ['type_1', 'id_1', 'name_1' ])raw_datab = {'type_2': [1, 1, 1, 0],'id_2': ['4', '5', '5', '7'],'name_2': ['Bill', 'Brian', 'Joe', 'Bryce']}
df_b = pd.DataFrame(raw_datab, columns = ['type_2', 'id_2', 'name_2'])raw_datac = {'type_3': [1, 0],'id_3': ['4', '7'],'name_3': ['School', 'White']}
df_c = pd.DataFrame(raw_datac, columns = ['type_3', 'id_3', 'name_3'])

The expected result should be:

type_1   id_1   name_1   type_2   id_2   name_2   type_3   id_3   name_3
0        3      Alex     NaN      NaN    NaN      NaN      NaN    NaN
1        4      Amy      1        4      Bill     1        4      School
1        5      Allen    1        5      Brian    NaN      NaN    NaN
1        5      Allen    1        5      Joe      NaN      NaN    NaN
1        5      Jane     1        5      Brian    NaN      NaN    NaN
1        5      Jane     1        5      Joe      NaN      NaN    NaN
NaN      NaN    NaN      0        7      Bryce    0        7      White

How can this be achieved in Pandas?

Answer

I'll propose that you make life less complicated and not have different names for the things you want to merge on.

da = df_a.set_index(['type_1', 'id_1']).rename_axis(['type', 'id'])
db = df_b.set_index(['type_2', 'id_2']).rename_axis(['type', 'id'])
dc = df_c.set_index(['type_3', 'id_3']).rename_axis(['type', 'id'])da.join(db, how='outer').join(dc, how='outer')name_1 name_2  name_3
type id                      
0    3    Alex    NaN     NaN7     NaN  Bryce   White
1    4     Amy   Bill  School5   Allen  Brian     NaN5   Allen    Joe     NaN5    Jane  Brian     NaN5    Jane    Joe     NaN

Here's an obnoxious way to get those other columns

from cytoolz.dicttoolz import mergei = pd.DataFrame(d.index.values.tolist(), d.index, d.index.names)
d = d.assign(**merge(i.mask(d[f'name_{j}'].isna()).add_suffix(f'_{j}').to_dict('l')for j in [1, 2, 3]
))d[sorted(d.columns, key=lambda x: x.split('_')[::-1])]id_1 name_1  type_1 id_2 name_2  type_2 id_3  name_3  type_3
type id                                                             
0    3     3   Alex     0.0  NaN    NaN     NaN  NaN     NaN     NaN7   NaN    NaN     NaN    7  Bryce     0.0    7   White     0.0
1    4     4    Amy     1.0    4   Bill     1.0    4  School     1.05     5  Allen     1.0    5  Brian     1.0  NaN     NaN     NaN5     5  Allen     1.0    5    Joe     1.0  NaN     NaN     NaN5     5   Jane     1.0    5  Brian     1.0  NaN     NaN     NaN5     5   Jane     1.0    5    Joe     1.0  NaN     NaN     NaN
https://en.xdnf.cn/q/70008.html

Related Q&A

How can I add a level to a MultiIndex?

index = [np.array([foo, foo, qux]),np.array([a, b, a])] data = np.random.randn(3, 2) columns = ["X", "Y"] df = pd.DataFrame(data, index=index, columns=columns) df.index.names = [&qu…

decoupled frontend and backend with Django, webpack, reactjs, react-router

I am trying to decouple my frontend and my backend in my project. My frontend is made up of reactjs and routing will be done with react-router, My backend if made form Django and I plan to use the fron…

Map colors in image to closest member of a list of colors, in Python

I have a list of 19 colors, which is a numpy array of size (19,3):colors = np.array([[0, 0, 0], [0, 0, 255], [255, 0, 0], [150, 30, 150], [255, 65, 255], [150, 80, 0], [170, 120, 65], [125, 125,…

Storing a file in the clipboard in python

Is there a way to use the win32clipboard module to store a reference to a file in the windows clipboard in python. My goal is to paste an image in a way that allows transparency. If I drag and drop a…

retrieve intermediate features from a pipeline in Scikit (Python)

I am using a pipeline very similar to the one given in this example : >>> text_clf = Pipeline([(vect, CountVectorizer()), ... (tfidf, TfidfTransformer()), ... …

Any way to do integer division in sympy?

I have a very long expression that I think can be simplified, and I thought sympy would be the perfect way to do it. Unfortunately the formula relies on a couple of integer divides, and I cant find any…

Scrapy LinkExtractor - Limit the number of pages crawled per URL

I am trying to limit the number of crawled pages per URL in a CrawlSpider in Scrapy. I have a list of start_urls and I want to set a limit on the numbers pages are being crawled in each URL. Once the l…

Python Invalid format string [duplicate]

This question already has answers here:Python time formatting different in Windows(3 answers)Closed 9 years ago.I am trying to print the date in the following format using strftime: 06-03-2007 05:40PMI…

Python template safe substitution with the custom double-braces format

I am trying to substitute variables in the format {{var}} with Pythons Template. from string import Templateclass CustomTemplate(Template):delimiter = {{pattern = r\{\{(?:(?P<escaped>\{\{)|(?P…

Emit signal in standard python thread

I have a threaded application where I do have a network thread. The UI-part passes a callback to this thread. The thread is a normal python thread - its NO QThread.Is it possible to emit PyQT Slot with…