Fastest algorithm for finding overlap between two very large lists?

2024/9/17 3:20:37

I'm trying to build an algorithm in Python to filter a large block of RDF data.

I have one list consisting of about 70 thousand items formatted like <"datum">.

I then have about 6GB worth of items (triples) formatted like <"A"> <"B"> <"C">

I want to extract all the triples that contain any item in the first list, and then extract any triples that contain any individual item from the first extraction (the net effect is to form a partition of the graph that's connected by one step to the seeds from the first list).

I haven't been able to come up with a great algorithm for this (not helped by the fact that I have no formal CS training.)

The best I've come up with so far is to start by splitting the triples in the big list into a list of three item lists [<"A">, <"B">, <"C">]. I then split that into chunks, and use multiprocessing to create processes that take the full small list and a chunk of the big list and...

for line in big list:for item in small list:if item in line:bucket.append(line)

This algorithm takes quite a while.

Is there any faster way to do this? If there's a specific algorithm, you can just give me the name and I'll figure out how to implement it.

Thanks!

Clarifications per comments:

  1. All the data items are strings. So small list might contain ["Mickey", "Mouse", "Minny", "Cat"] and big list might be [["Mickey","Pluto","Bluto"],["John", "Jane", "Jim]...]

  2. Only one item in each big list triple needs to match an item for the small list for that to count

  3. All of the items in the small list are actually unique, so I didn't think to convert them to a set anyway. But I will try that.

  4. I can create whatever intermediate structures I want. I'm experimenting with an inverted index constructed using a shelve right now.

Answer

You should probably first store the small list in a set, so lookup is faster. This prevents going through 70,000 iterations for every item in big_list.

small_list_set = set(small_list)
for line in big_list:for item in line:if item in small_list_set:bucket.append(line)
https://en.xdnf.cn/q/72238.html

Related Q&A

Call Postgres SQL stored procedure From Django

I am working on a Django Project with a Postgres SQL Database. I have written a stored procedure that runs perfectly on Postgres.Now I want to call that stored procedure from Django 1.5 .. I have writt…

How can I mix decorators with the @contextmanager decorator?

Here is the code Im working with:from contextlib import contextmanager from functools import wraps class with_report_status(object):def __init__(self, message):self.message = messagedef __call__(self, …

supervisord always returns exit status 127 at WebFaction

I keep getting the following errors from supervisord at webFaction when tailing the log:INFO exited: my_app (exit status 127; not expected) INFO gave up: my_app entered FATAL state, too many start retr…

One dimensional Mahalanobis Distance in Python

Ive been trying to validate my code to calculate Mahalanobis distance written in Python (and double check to compare the result in OpenCV) My data points are of 1 dimension each (5 rows x 1 column). I…

DeprecationWarning: please use dns.resolver.Resolver.resolve()

I am using resolver() as an alternative to socket() as I found that when multiple connections are made to different IPs it ends up stopping working. Anyway it returns a warning to me that I should use …

python cannot find module when using ssh

Im using python on servers. When I run a python command which needs numpy module, if I do ssh <server name> <python command>that server will complain no module named numpy found.However, if…

Python sklearn installation windows

When trying to install Pythons sklearn package on Windows 10 using pip I am given an EnvironmentError that tells me there is no such file or directory of a specific file: ERROR: Could not install packa…

Python PSD layers?

I need to write a Python program for loading a PSD photoshop image, which has multiple layers and spit out png files (one for each layer). Can you do that in Python? Ive tried PIL, but there doesnt se…

Multiple inheritance: overridden methods containing super()-calls

With the file super5.py:class A:def m(self):print("m of A called")class B(A):def m(self):print("m of B called")super().m()class C(A):def m(self):print("m of C called")supe…

Specifying the output file name in Apache Spark

I have a MapReduce job that Im trying to migrate to PySpark. Is there any way of defining the name of the output file, rather than getting part-xxxxx?In MR, I was using the org.apache.hadoop.mapred.li…