I have two large files. Their contents looks like this:
134430513
125296589
151963957
125296589
The file contains an unsorted list of ids. Some ids may appear more than one time in a single file.
Now I want to find the intersection part of two files. That is the ids appear in both files.
I just read the two files into 2 sets, s1
and s2
. And get the intersection by s1.intersection(s2)
. But it consumes a lot of memory and seems slow.
So is there any better or pythonic way to do this? If the file contains so many ids that can not be read into a set
with limited memory, what can I do?
EDIT: I read the file into 2 sets using a generator:
def id_gen(path):for line in open(path):tmp = line.split()yield int(tmp[0])c1 = id_gen(path)
s1 = set(c1)
All of the ids are numeric. And the max id may be 5000000000. If use bitarray, it will consume more memory.