How to filter overlap rows in a big file in python

2024/10/15 17:23:31

I am trying to filter overlap rows in a big file in python.The overlap degrees is set to 25%. In other words,the number of element of intersection between any two rows is less than 0.25 times of union of them.if more than 0.25,one row is deleted.So if I have a big file with 1000 000 rows in total, the first 5 rows are as follows:

c6 c24 c32 c54 c67
c6 c24 c32 c51 c68 c78
c6 c32 c54 c67
c6 c32 c55 c63 c85 c94 c75
c6 c32 c53 c67

Because the number of element of intersection between the 1st row and 2nd row is 3,(such as c6,c24,c32 ),the number of union between them is 8,(such as c6,c24,c32,c54,c67,c51,c68,c78). The overlap degrees is 3/8=0.375 > 0.25,the 2nd row is deleted.so do the 3rd and 5th rows.The final answer is the 1st and 4th row.

c6 c24 c32 c54 c67
c6 c32 c55 c63 c85 c94 c75

The pseudo code are as follows:

for i=1:(n-1)    # n is the number of rows of the big filefor j=(i+1):n  if  overlap degrees of the ith row and jth row is more than 0.25delete the jth row from the big fileendend

end

how to solve this problem in python? Thank you!

Answer

The tricky part is that you have to modify the list you're iterating over and still keep track of two indices. One way to do that is to go backwards, since deleting an item with index equal to or larger than the indices you keep track of will not influence them.

This code is untested, but you get the idea:

with open("file.txt") as fileobj:sets = [set(line.split()) for line in fileobj]for first_index in range(len(sets) - 2, -1, -1):for second_index in range(len(sets) - 1, first_index, -1):union = sets[first_index] | sets[second_index]intersection = sets[first_index] & sets[second_index]if len(intersection) / float(len(union)) > 0.25:del sets[second_index]
with open("output.txt", "w") as fileobj:for set_ in sets:# order of the set is undefined, so we need to sort each setoutput = " ".join(sorted(set_, key=lambda x: int(x[1:])))fileobj.write("{0}\n".format(output))

Since it's obvious how to sort the elements of each line we could do it like this. If the order was somehow custom, we'd have to couple the read line with each set element so that we could write back exactly the line that was read at the end, instead of regenerating it.

https://en.xdnf.cn/q/117807.html

Related Q&A

Installing wxPython in Ubuntu 12.10

I am trying to install wxPython on my Ubuntu 12.10 but with no success. I have gone through all the answers given on this website. Can someone please help me in this or point me in the right direction.…

Open a file name +date as csv in Python [closed]

Its difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying thi…

TextCtrl providing an out of bound exception in wxPython

I am new to WX, so I decided to make a program that will periodically write out a line of text to the screen based on an outside input. The basis of the program contains a basic window with the multili…

Matplotlib: different color for every point of line plot

Im trying to make a plot like in the following figure (source of image): Im talking about the plot in the right panel even though there is some correlation between the two panels: the colorbar. Just s…

getting template syntax error in django template

my code in view :tracks = client.get(/tracks, order=hotness, limit=4) artwork_url=[] for track in tracks:artwork_url.append(str(track.artwork_url).replace("large", "t300x300")) …

Python threading:Is it okay to read/write multiple mutually exclusive parts of a file concurrently?

I know we can guarantee correctness either by locking or using a specialized thread whose sole job is to read/write and communicate with it through queue. But this approach seems logically ok, so I wan…

Theano Cost Function, TypeError: Unknown parameter type: class numpy.ndarray

Im new to Theano, just learning it. I have a ANN in python that Im implementing in Theano as learning process. Im using Spyder.And Theano throws out an error: TypeError: Unknown parameter type: class n…

Bar plotting grouped Pandas

I have a question regarding plotting grouped DataFrame data.The data looks like:data =index taste food0 good cheese 1 bad tomato 2 worse tomato 3 worse …

Nginx+bottle+uwsgi Server returning 404 on every request

I have setup an Nginx server with following configuration:server {listen 8080;server_name localhost;location / {include uwsgi_params;uwsgi_pass unix:/tmp/uwsgi.notesapi.socket;uwsgi_param UWSGI_PYHOME …

Checking multiple for items in a for loop python

Ive written a code to tell the user that the their brackets are not balanced.I can exactly tell where my code is going wrong.Once it comes across the first situation of brackets, it does not continue t…