Problem Summary
Given 2 excel files, each with 200 columns approx, and have a common index column - ie each row in both files would have a name property say, what would be the best to generate an output excel file which just has the differences from excel file 2 to excel file 1. The differences would be defined as any new rows in file 2 not in file1, and rows in file2 that have the same index (name), but one or more of the other columns are different. There is a good example here using pandas that could be useful : Compare 2 Excel files and output an Excel file with differences Difficult to apply that solution to an excel file with 200 columns though.
Sample Files
Below is a sample of 2 simplified (columns reduced from 200 to 4) excel files in csv format, index column is Name.
Name,value,location,Name Copy
Bob,400,Sydney,Bob
Tim,500,Perth,TimName,value,location,Name Copy
Bob,400,Sydney,Bob
Tim,500,Adelaide,Tim
Melanie,600,Brisbane,Melanie
So given the above 2 input files, the output file should be :
Name,value,location,Name Copy
Tim,500,Adelaide,Tim
Melanie,600,Brisbane,Melanie
So the output file would have 2 rows (not including column title row), rows 2 is a new row not in file1, and row 1 contains changes from file1 to file2.
The following works, but the index column is lost (it's [1, 2] instead of ['Tim', 'Melanie'] :
import pandas as pd
df1 = pd.read_excel('simple1.xlsx', index_col=0)
df2 = pd.read_excel('simple2.xlsx', index_col=0)df3 = pd.merge(df1, df2, how='right', sort='False', indicator='Indicator')
df4 = df3.loc[df3['Indicator'] == 'right_only']
df5 = df4.drop('Indicator', axis=1)writer = pd.ExcelWriter('test.xlsx', engine='xlsxwriter')
df5.to_excel(writer, sheet_name='Sheet1')
writer.save()