Question 1

Problem Summary

Given 2 excel files, each with 200 columns approx, and have a common index column - ie each row in both files would have a name property say, what would be the best to generate an output excel file which just has the differences from excel file 2 to excel file 1. The differences would be defined as any new rows in file 2 not in file1, and rows in file2 that have the same index (name), but one or more of the other columns are different. There is a good example here using pandas that could be useful : Compare 2 Excel files and output an Excel file with differences Difficult to apply that solution to an excel file with 200 columns though.

Sample Files

Below is a sample of 2 simplified (columns reduced from 200 to 4) excel files in csv format, index column is Name.

Name,value,location,Name Copy
Bob,400,Sydney,Bob
Tim,500,Perth,TimName,value,location,Name Copy
Bob,400,Sydney,Bob
Tim,500,Adelaide,Tim
Melanie,600,Brisbane,Melanie

So given the above 2 input files, the output file should be :

Name,value,location,Name Copy
Tim,500,Adelaide,Tim
Melanie,600,Brisbane,Melanie

So the output file would have 2 rows (not including column title row), rows 2 is a new row not in file1, and row 1 contains changes from file1 to file2.

The following works, but the index column is lost (it's [1, 2] instead of ['Tim', 'Melanie'] :

import pandas as pd
df1 = pd.read_excel('simple1.xlsx', index_col=0)
df2 = pd.read_excel('simple2.xlsx', index_col=0)df3 = pd.merge(df1, df2, how='right', sort='False', indicator='Indicator')
df4 = df3.loc[df3['Indicator'] == 'right_only']
df5 = df4.drop('Indicator', axis=1)writer = pd.ExcelWriter('test.xlsx', engine='xlsxwriter')
df5.to_excel(writer, sheet_name='Sheet1')
writer.save()

Question 2

The solution was to use numpy.array_equal to determine if rows were equal or not :

import sys
import pandas as pd
import numpy as np# Check for correct number of input arguments
if len(sys.argv) != 4:print('Usage :\n\tpython {} old_excel_file new_excel_file  output_excel_file\n'.format(sys.argv[0]))
quit()# Import input files into dataframes
old_file = sys.argv[1]
new_file = sys.argv[2]
out_file = sys.argv[3]
df1 = pd.read_excel(old_file, index_col=0)
df2 = pd.read_excel(new_file, index_col=0)# Merge dataframes, maintaining index 
df_merged = pd.merge(df1, df2, left_index=True, right_index=True, how='outer', sort=False, indicator='Indicator')# Add right-only rows to output dataframe
right_only_index = df_merged.index[df_merged['Indicator'] == 'right_only']
df_out = df2.loc[right_only_index]# Iterate through "both" rows, and append ones that are not equal to the output dataframe
both_index = df_merged.index[df_merged['Indicator'] == 'both']
df_both = df2.loc[both_index]for i, values in df_both.iterrows():if not np.array_equal(df1.loc[i].values, df2.loc[i].values):df_out = df_out.append(df2.loc[i])# Write output dataframe to an Excel file (first the two header rows, and then the data rows)
writer = pd.ExcelWriter(out_file, engine='xlsxwriter')
df_out.to_excel(writer, sheet_name='Sheet1')
writer.save()

Get differences between two excel files

Problem Summary

Sample Files

Related Q&A

Most pythonic way to call a list of functions [closed]

Unable to get wanted output using for and range functions

i want to add a new field to an existing module odoo11 but i dont know why it didnt work

How to check with more RegEx for one address in python using re.findall()

Finding all roots of an equation in Python

defining matrix class in python

Why do round() and math.ceil() give different values for negative numbers in Python 3? [duplicate]

getting the value from text file after the colon or before the colon in python

How to remove the background of an object using OpenCV (Python) [closed]

How to move zeros to the end of a list [closed]