Get differences between two excel files

2024/11/24 2:44:27

Problem Summary

Given 2 excel files, each with 200 columns approx, and have a common index column - ie each row in both files would have a name property say, what would be the best to generate an output excel file which just has the differences from excel file 2 to excel file 1. The differences would be defined as any new rows in file 2 not in file1, and rows in file2 that have the same index (name), but one or more of the other columns are different. There is a good example here using pandas that could be useful : Compare 2 Excel files and output an Excel file with differences Difficult to apply that solution to an excel file with 200 columns though.

Sample Files

Below is a sample of 2 simplified (columns reduced from 200 to 4) excel files in csv format, index column is Name.

Name,value,location,Name Copy
Bob,400,Sydney,Bob
Tim,500,Perth,TimName,value,location,Name Copy
Bob,400,Sydney,Bob
Tim,500,Adelaide,Tim
Melanie,600,Brisbane,Melanie

So given the above 2 input files, the output file should be :

Name,value,location,Name Copy
Tim,500,Adelaide,Tim
Melanie,600,Brisbane,Melanie

So the output file would have 2 rows (not including column title row), rows 2 is a new row not in file1, and row 1 contains changes from file1 to file2.

The following works, but the index column is lost (it's [1, 2] instead of ['Tim', 'Melanie'] :

import pandas as pd
df1 = pd.read_excel('simple1.xlsx', index_col=0)
df2 = pd.read_excel('simple2.xlsx', index_col=0)df3 = pd.merge(df1, df2, how='right', sort='False', indicator='Indicator')
df4 = df3.loc[df3['Indicator'] == 'right_only']
df5 = df4.drop('Indicator', axis=1)writer = pd.ExcelWriter('test.xlsx', engine='xlsxwriter')
df5.to_excel(writer, sheet_name='Sheet1')
writer.save()
Answer

The solution was to use numpy.array_equal to determine if rows were equal or not :

import sys
import pandas as pd
import numpy as np# Check for correct number of input arguments
if len(sys.argv) != 4:print('Usage :\n\tpython {} old_excel_file new_excel_file  output_excel_file\n'.format(sys.argv[0]))
quit()# Import input files into dataframes
old_file = sys.argv[1]
new_file = sys.argv[2]
out_file = sys.argv[3]
df1 = pd.read_excel(old_file, index_col=0)
df2 = pd.read_excel(new_file, index_col=0)# Merge dataframes, maintaining index 
df_merged = pd.merge(df1, df2, left_index=True, right_index=True, how='outer', sort=False, indicator='Indicator')# Add right-only rows to output dataframe
right_only_index = df_merged.index[df_merged['Indicator'] == 'right_only']
df_out = df2.loc[right_only_index]# Iterate through "both" rows, and append ones that are not equal to the output dataframe
both_index = df_merged.index[df_merged['Indicator'] == 'both']
df_both = df2.loc[both_index]for i, values in df_both.iterrows():if not np.array_equal(df1.loc[i].values, df2.loc[i].values):df_out = df_out.append(df2.loc[i])# Write output dataframe to an Excel file (first the two header rows, and then the data rows)
writer = pd.ExcelWriter(out_file, engine='xlsxwriter')
df_out.to_excel(writer, sheet_name='Sheet1')
writer.save()
https://en.xdnf.cn/q/120583.html

Related Q&A

Most pythonic way to call a list of functions [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.Want to improve this question? Update the question so it focuses on one problem only by editing this post.Closed 4…

Unable to get wanted output using for and range functions

For my homework assignment, I am using the for loop and the range function. I have to create a loop that printsHello 0 Hello 1 Hello 3 Hello 6 Hello 10The question says that the number corresponds to t…

i want to add a new field to an existing module odoo11 but i dont know why it didnt work

product_template.xmlthe view to add the customizing field to the product module<?xml version="1.0" encoding="utf-8"?><odoo><data><record id="product_temp…

How to check with more RegEx for one address in python using re.findall()

How to check with more RegEx for one address in python using re.findall()Ex: I want to apply the below regex rules # need to get addresstxt = "hello user 44 West 22nd Street, New York, NY 12345 fr…

Finding all roots of an equation in Python

I have a function that I want to find its roots. I could write a program to figure out its roots but the point is, each time that I want to find the other root I should give it an initial value manuall…

defining matrix class in python

Define a class that abstracts the matrix that satisfies the following examples of practice

Why do round() and math.ceil() give different values for negative numbers in Python 3? [duplicate]

This question already has an answer here:What is the algorithmic difference between math.ceil() and round() when trailing decimal points are >= 0.5 in Python 3?(1 answer)Closed 6 years ago.Why do r…

getting the value from text file after the colon or before the colon in python

I only need the last number 25 so i can change it to int(). I have this text file:{"members": [{"name": "John", "location": "QC", "age": 25}…

How to remove the background of an object using OpenCV (Python) [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.Want to improve this question? Add details and clarify the problem by editing this post.Closed 3 years ago.Improve…

How to move zeros to the end of a list [closed]

Closed. This question needs debugging details. It is not currently accepting answers.Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to repro…