I'm on a linux machine (Redhat) and I have an 11GB text file. Each line in the text file contains data for a single record and the first n characters of the line contains a unique identifier for the record. The file contains a little over 27 million records.
I need to verify that there are not multiple records with the same unique identifier in the file. I also need to perform this process on an 80GB text file so any solution that requires loading the entire file into memory would not be practical.
Read the file line-by-line, so you don't have to load it all into memory.
For each line (record) create a sha256 hash (32 bytes), unless your identifier is shorter.
Store the hashes/identifiers in an numpy.array
. That is probably the most compact way to store them. 27 million records times 32 bytes/hash is 864 MB. That should fit into the memory of decent machine these days.
To speed up access you could use the first e.g. 2 bytes of the hash as the key of a collections.defaultdict
and put the rest of the hashes in a list in the value. This would in effect create a hash table with 65536 buckets. For 27e6 records, each bucket would contain on average a list of around 400 entries.
It would mean faster searching than a numpy array, but it would use more memory.
d = collections.defaultdict(list)
with open('bigdata.txt', 'r') as datafile:for line in datafile:id = hashlib.sha256(line).digest()# Or id = line[:n]k = id[0:2]v = id[2:]if v in d[k]:print "double found:", idelse:d[k].append(v)