Question 1

I'm trying to build an algorithm in Python to filter a large block of RDF data.

I have one list consisting of about 70 thousand items formatted like <"datum">.

I then have about 6GB worth of items (triples) formatted like <"A"> <"B"> <"C">

I want to extract all the triples that contain any item in the first list, and then extract any triples that contain any individual item from the first extraction (the net effect is to form a partition of the graph that's connected by one step to the seeds from the first list).

I haven't been able to come up with a great algorithm for this (not helped by the fact that I have no formal CS training.)

The best I've come up with so far is to start by splitting the triples in the big list into a list of three item lists [<"A">, <"B">, <"C">]. I then split that into chunks, and use multiprocessing to create processes that take the full small list and a chunk of the big list and...

for line in big list:for item in small list:if item in line:bucket.append(line)

This algorithm takes quite a while.

Is there any faster way to do this? If there's a specific algorithm, you can just give me the name and I'll figure out how to implement it.

Thanks!

Clarifications per comments:

All the data items are strings. So small list might contain ["Mickey", "Mouse", "Minny", "Cat"] and big list might be [["Mickey","Pluto","Bluto"],["John", "Jane", "Jim]...]
Only one item in each big list triple needs to match an item for the small list for that to count
All of the items in the small list are actually unique, so I didn't think to convert them to a set anyway. But I will try that.
I can create whatever intermediate structures I want. I'm experimenting with an inverted index constructed using a shelve right now.

Question 2

You should probably first store the small list in a set, so lookup is faster. This prevents going through 70,000 iterations for every item in big_list.

small_list_set = set(small_list)
for line in big_list:for item in line:if item in small_list_set:bucket.append(line)

Fastest algorithm for finding overlap between two very large lists?

Related Q&A

Call Postgres SQL stored procedure From Django

How can I mix decorators with the @contextmanager decorator?

supervisord always returns exit status 127 at WebFaction

One dimensional Mahalanobis Distance in Python

DeprecationWarning: please use dns.resolver.Resolver.resolve()

python cannot find module when using ssh

Python sklearn installation windows

Python PSD layers?

Multiple inheritance: overridden methods containing super()-calls

Specifying the output file name in Apache Spark