Compare 2 Excel files and output an Excel file with differences

2024/9/22 8:34:52

Assume for simplicity that the data files look like this, sorted on ID:

ID  | Data1  | Data2  | Data3   | Data4
199 |  Tim   |   55   |  work  |  $55
345 |  Joe   |   45   |  work  |  $34
356 |  Sam   |   23   |  uni   |  $12

Each file has more than 100,000 rows and about 50 columns. I want to compare a 2nd file with the first for new records (new ID ), edits (IDs match but columns 2 or 4 have changed (Data1 and Data3), and Deletes (ID in first file does not exist in the 2nd file).

Output is to appear in an Excel file with the first column containing D, E or N (for Delete, Edit and New), and the rest of the columns being the same as the columns in the files being compared.

For new records the full new record is to appear in the output file. For Edits both the records are to appear in the output file, but only those fields that have changed are to appear. For deleted records the full old record is to appear in the output file.

I would also like the following output to the screen as the files are being processed:

Deletes: D: 199, Tim
Edits:   E: 345, Joe -> JohnE: 345, work -> xxx
New:     N: 999, Ami

Thanks.

Answer

I suggest you read some of the excellent introductions to pandas to understand how and why this works and to adapt it to your specific needs

Reading the excel-files

pandas.read_excel

import pandas as pdfilename1 = 'filename1.xlsx'
filename2 = 'filename2.xlsx'df1 = pd.read_excel(filename1, index_col=0)
df2 = pd.read_excel(filename2, index_col=0)

df1 and df2 should be pandas.DataFrames with ID as index and the first row as columns or headers

Merging the files

pandas.merge

df_merged = pd.merge(df1, df2, left_index=True, right_index=True, how='outer', sort=False, indicator=True)

Selecting the changes

id_new = df_merged.index[df_merged['_merge'] == 'right_only'] 
id_deleted = df_merged.index[df_merged['_merge'] == 'left_only'] 
id_changed_data1 = df_merged.index[(df_merged['_merge'] == 'both') & (df_merged['Data1_x'] != df_merged['Data1_y'])]
id_changed_data3 = df_merged.index[(df_merged['_merge'] == 'both') & (df_merged['Data3_x'] != df_merged['Data3_y'])]

This gives you lists (or an Index rather) of the changes, which you can format as you want

https://en.xdnf.cn/q/119157.html

Related Q&A

Problem to show data dynamically table in python flask

I want to show a table in the html using python flask framework. I have two array. One for column heading and another for data record. The length of the column heading and data record are dynamic. I ca…

RuntimeError: generator raised StopIteration

I am in a course and try to find my problem. I cant understand why if I enter something other than 9 digits, the if should raise the StopIteration and then I want it to go to except and print it out. W…

split list elements into sub-elements in pandas dataframe

I have a dataframe as:-Filtered_data[defence possessed russia china,factors driving china modernise] [force bolster pentagon,strike capabilities pentagon congress detailing china] [missiles warheads, d…

Image does not display on Pyqt [duplicate]

This question already has an answer here:Why Icon and images are not shown when I execute Python QT5 code?(1 answer)Closed 2 years ago.I am using Pyqt5, python3.9, and windows 11. I am trying to add a…

Expand the following dictionary into following list

how to generate the following list from the following dictionary d = {2: 4, 3: 1, 5: 3}f = [2**1,2**2, 2**3, 2**4, 3**1, 5**1, 5**2, 5**3, 2**1 * 3, 2**2 * 3, 2**3 * 3, 2**4 * 3, 5**1 * 3, 5**2 * 3, 5*…

how can I improve the accuracy rate of the below trained model using CNN

I have trained a model using python detect the colors of the gemstone and have built a CNN.Herewith Iam attaching the code of mine.(Referred https://www.kaggle.com) import os import matplotlib.pyplot a…

Classes and methods, with lists in Python

I have two classes, called "Pussa" and "Cat". The Pussa has an int atribute idPussa, and the Cat class has two atributes, a list of "Pussa" and an int catNum. Every class …

polars dataframe TypeError: must be real number, not str

so bascially i changed panda.frame to polars.frame for better speed in yolov5 but when i run the code, it works fine till some point (i dont exactly know when error occurs) and it gives me TypeError: m…

Get a string in Shell/Python using sys.argv

Im beginning with bash and Im executing a script :$ ./readtext.sh ./InputFiles/applications.txt Here is my readtext.sh code :#!/bin/bash filename="$1" counter=1 while IFS=: true; doline=read …

Need to combine two functions into one (Python)

Here is my code-def Max(lst):if len(lst) == 1:return lst[0]else:m = Max(lst[1:])if m > lst[0]: return melse:return lst[0] def Min(lst):if len(lst) == 1:return lst[0]else:m = Min(lst[1:])if m < ls…