How can I find intersection of two large file efficiently using python?

2024/9/19 17:03:14

I have two large files. Their contents looks like this:

134430513
125296589
151963957
125296589

The file contains an unsorted list of ids. Some ids may appear more than one time in a single file.

Now I want to find the intersection part of two files. That is the ids appear in both files.

I just read the two files into 2 sets, s1 and s2. And get the intersection by s1.intersection(s2) . But it consumes a lot of memory and seems slow.

So is there any better or pythonic way to do this? If the file contains so many ids that can not be read into a set with limited memory, what can I do?

EDIT: I read the file into 2 sets using a generator:

def id_gen(path):for line in open(path):tmp = line.split()yield int(tmp[0])c1 = id_gen(path)
s1 = set(c1)

All of the ids are numeric. And the max id may be 5000000000. If use bitarray, it will consume more memory.

Answer

Others have shown the more idiomatic ways of doing this in Python, but if the size of the data really is too big, you can use the system utilities to sort and eliminate duplicates, then use the fact that a File is an iterator which returns one line at a time, doing something like:

import os
os.system('sort -u -n s1.num > s1.ns')
os.system('sort -u -n s2.num > s2.ns')
i1 = open('s1.ns', 'r')
i2 = open('s2.ns', 'r')
try:d1 = i1.next()d2 = i2.next()while True:if (d1 < d2):d1 = i1.next()elif (d2 < d1):d2 = i2.next()else:print d1,d1 = i1.next()d2 = i2.next()
except StopIteration:pass

This avoids having more than one line at a time (for each file) in memory (and the system sort should be faster than anything Python can do, as it is optimized for this one task).

https://en.xdnf.cn/q/72535.html

Related Q&A

Failed to load the native TensorFlow runtime - TensorFlow 2.1

I have a desktop computer and a notebook, when I tried to install tensorflow on a notebook just by using pip install tensorflow it worked ok, then I tried the same on my desktop computer and when I tri…

(Python) Issues with directories that have special characters

OS: Windows server 03 Python ver: 2.7For the code below, its runs fine when I substitute "[email protected]" with "fuchida". If I use the email format for directory name I get the f…

LibCST: Converting arbitrary nodes to code

Is it possible to dump an arbitrary LibCST node into Python code? My use case is that I want to extract the code for functions that match a specific naming scheme. I can extract the FunctionDef nodes …

calculating the number of k-combinations with and without SciPy

Im puzzled by the fact that the function comb of SciPy appears to be slower than a naive Python implementation. This is the measured time for two equivalent programs solving the Problem 53 of Project E…

How to subclass requests in python through inheritance

I would like to specialize / subclass the requests package to add some method with custom functionality.I tried to do this:# concrete_requests.py import requestsclass concreteRequests(requests):def __i…

ipython debugger: full traceback on interactive pdb?

I recently switched from ipython0.10 to ipython0.11. In ipython0.11, I only see a small snippet of the full traceback when the python debugger engages (i.e. using %pdb), whereas in ipython0.10 Id see …

How to get all users in a list Twitter API?

Is there a way to access all members in a list? Currently, I can only see the first 20 members? Specifically, Im using python and tweepy.

Python adding a blank/empty column. csv

Hello I have a database that i am trying to make a .csv file quickly from.my data looks like this.Song_Name,File_Name,Artist_Name,Artist_ID Song1,filename1,artistname,artist001 Song1,filename1,artistna…

Displaying Radio buttons horizontally in matplotlib

I am using the matplotlib.widgets to create radio buttons in my widgets, the buttons coming are stacked vertically, I would like them to be stacked horizontally.MVCE:import matplotlib.pyplot as plt fro…

Python MemoryError on large array

This is the python script that Im trying to run:n = 50000000000 ##50 billion b = [0]*n for x in range(0,n):b[x] = random.randint(1,899999)... But the output Im getting is:E:\python\> python sort.py…