Find duplicate records in large text file

2024/10/2 8:39:55

I'm on a linux machine (Redhat) and I have an 11GB text file. Each line in the text file contains data for a single record and the first n characters of the line contains a unique identifier for the record. The file contains a little over 27 million records.

I need to verify that there are not multiple records with the same unique identifier in the file. I also need to perform this process on an 80GB text file so any solution that requires loading the entire file into memory would not be practical.

Answer

Read the file line-by-line, so you don't have to load it all into memory.

For each line (record) create a sha256 hash (32 bytes), unless your identifier is shorter.

Store the hashes/identifiers in an numpy.array. That is probably the most compact way to store them. 27 million records times 32 bytes/hash is 864 MB. That should fit into the memory of decent machine these days.

To speed up access you could use the first e.g. 2 bytes of the hash as the key of a collections.defaultdict and put the rest of the hashes in a list in the value. This would in effect create a hash table with 65536 buckets. For 27e6 records, each bucket would contain on average a list of around 400 entries. It would mean faster searching than a numpy array, but it would use more memory.

d = collections.defaultdict(list)
with open('bigdata.txt', 'r') as datafile:for line in datafile:id = hashlib.sha256(line).digest()# Or id = line[:n]k = id[0:2]v = id[2:]if v in d[k]:print "double found:", idelse:d[k].append(v)
https://en.xdnf.cn/q/70872.html

Related Q&A

Select the first item from a drop down by index is not working. Unbound method select_by_index

I am trying to click the first item from a drop down. I want to use its index value because the value could be different each time. I only need to select the 1st item in the drop down for this particul…

finding the app window currently in focus on Mac OSX

I am writing a desktop usage statistics app. It runs a background daemon which wakes up at regular intervals, finds the name of the application window currently in focus and logs that data in database.…

os.system vs subprocess in python on linux

I have two python scripts. The first script calls a table of second scripts in which I need to execute a third party python script. It looks something like this: # the call from the first script. cmd …

How can I call a python script from a python script

I have a python script b.py which prints out time ever 5 sec.while (1):print "Start : %s" % time.ctime()time.sleep( 5 )print "End : %s" % time.ctime()time.sleep( 5 )And in my a.py, …

Read EXE, MSI, and ZIP file metadata in Python in Linux

I am writing a Python script to index a large set of Windows installers into a DB. I would like top know how to read the metadata information (Company, Product Name, Version, etc) from EXE, MSI and ZIP…

IllegalArgumentException thrown when count and collect function in spark

I tried to load a small dataset on local Spark when this exception is thrown when I used count() in PySpark (take() seems working). I tried to search about this issue but got no luck in figuring out wh…

Check if string does not contain strings from the list

I have the following code: mystring = ["reddit", "google"] mylist = ["a", "b", "c", "d"] for mystr in mystring:if any(x not in mystr for x in…

How do I conditionally include a file in a Sphinx toctree? [duplicate]

This question already has answers here:Conditional toctree in Sphinx(4 answers)Closed 8 years ago.I would like to include one of my files in my Sphinx TOC only when a certain tag is set, however the ob…

Use BeautifulSoup to extract sibling nodes between two nodes

Ive got a document like this:<p class="top">I dont want this</p><p>I want this</p> <table><!-- ... --> </table><img ... /><p> and all tha…

Put HTML into ValidationError in Django

I want to put an anchor tag into this ValidationError:Customer.objects.get(email=value)if self.register:# this address is already registeredraise forms.ValidationError(_(An account already exists for t…