How to diff the two files using Python Generator

2024/11/15 17:47:26

I have one file of 100GB having 1 to 1000000000000 separated by new line. In this some lines are missing like 5, 11, 19919 etc. My Ram size is 8GB.

How to find the missing elements.

My idea take another file for i in range(1,1000000000000) read the lines one by one using the generator. can we use yield statement for this

Can help in writing the code

My Code, the below code taking as a list in does the below code can use it for production.?

def difference(a,b):with open(a,'r') as f:aunique=set(f.readlines())with open(b,'r') as f:bunique=set(f.readlines())with open('c','a+') as f:for line in list(bunique - aunique):f.write(line)
Answer

If the values are in sequential order, you can simply note the previous value and see if the difference equals one:

prev = 0
with open('numbers.txt','r') as f:for line in f:value = int(line.strip())for i in range(prev, value-1):print('missing:', i+1)prev = value
# output numbers that are missing at the end of the file (see comment by @blhsing)
for i in range(prev, 1000000000000):print('missing:', i+1)

This should work fine in python3, as readlines is an iterator so will not load the full file at once or keep it in memory.

https://en.xdnf.cn/q/71441.html

Related Q&A

How to resolve: attempted relative import with no known parent package [duplicate]

This question already has answers here:Attempted relative import with no known parent package [duplicate](4 answers)Closed 2 years ago.I have a bare bones project structure with mostly empty python fil…

How to create a figure of subplots of grouped bar charts in python

I want to combine multiple grouped bar charts into one figure, as the image below shows. grouped bar charts in a single figure import matplotlib import matplotlib.pyplot as plt import numpy as nplabels…

Python Pillow: Make image progressive before sending to 3rd party server

I have an image that I am uploading using Django Forms, and its available in the variable as InMemoryFile What I want to do is to make it progressive.Code to make an image a progressiveimg = Image.open…

Python - Should one start a new project directly in Python 3.x?

What Python version can you please recommend for a long-term (years) project? Should one use 2.6+ or 3.x is already stable? (only standard libraries are required)UPDATE: according to the answers belo…

Produce random wavefunction

I need to produce a random curve in matplotlib.My x values are from say 1 to 1000 for example. I dont want to generate scattered random y values, I need a smooth curve. Like some kind of very distorted…

How to reference groupby index when using apply, transform, agg - Python Pandas?

To be concrete, say we have two DataFrames:df1:date A 0 12/1/14 3 1 12/1/14 1 2 12/3/14 2 3 12/3/14 3 4 12/3/14 4 5 12/6/14 5df2:B 12/1/14 10 12/2/14 20 12/3/14 10 12/4/14 30 12/5/14 10 …

Google AppEngine Endpoints Error: Fetching service config failed (status code 404)

I am implementing the steps in the Quickstart.I did notice another question on this. I double checked that env_variables section in app.yaml has the right values for ENDPOINTS_SERVICE_NAME and ENDPOIN…

How to unload a .NET assembly reference in IronPython

After loading a reference to an assembly with something like:import clr clr.AddRferenceToFileAndPath(rC:\foo.dll)How can I unload the assembly again?Why would anyone ever want to do this? Because Im …

Bad key axes.prop_cycle Error while using an mplstyle in matplotlib (Python)

I am getting the following error when I try to use an external style sheet loaded locally. Bad key "axes.prop_cycle" on line 270 in idt.mplstyle. You probably need to get an updated matplotli…

Dollar notation in script languages - why? [closed]

Closed. This question is off-topic. It is not currently accepting answers.Want to improve this question? Update the question so its on-topic for Stack Overflow.Closed 12 years ago.Improve this questio…