Most efficient way to index words in a document?

2024/10/3 23:19:54

This came up in another question but I figured it is best to ask this as a separate question. Give a large list of sentences (order of 100 thousands):

[
"This is sentence 1 as an example",
"This is sentence 1 as another example",
"This is sentence 2",
"This is sentence 3 as another example ",
"This is sentence 4"
]

what is the best way to code the following function?

def GetSentences(word1, word2, position):return ""

where given two words, word1, word2 and a position position, the function should return the list of all sentences satisfying that constraint. For example:

GetSentences("sentence", "another", 3)

should return sentences 1 and 3 as the index of the sentences. My current approach was using a dictionary like this:

Index = defaultdict(lambda: defaultdict(lambda: defaultdict(lambda: [])))for sentenceIndex, sentence in enumerate(sentences):words = sentence.split()for index, word in enumerate(words):for i, word2 in enumerate(words[index:):Index[word][word2][i+1].append(sentenceIndex)

But this quickly blows everything out of proportion on a dataset that is about 130 MB in size as my 48GB RAM is exhausted in less than 5 minutes. I somehow get a feeling this is a common problem but can't find any references on how to solve this efficiently. Any suggestions on how to approach this?

Answer

Use database for storing values.

  1. First add all the sentences to one table (they should have IDs). You may call it eg. sentences.
  2. Second, create table with words contained within all the sentences (call it eg. words, give each word an ID), saving connection between sentences' table records and words' table records within separate table (call it eg. sentences_words, it should have two columns, preferably word_id and sentence_id).
  3. When searching for sentences containing all the mentioned words, your job will be simplified:

    1. You should first find records from words table, where words are exactly the ones you search for. The query could look like this:

      SELECT `id` FROM `words` WHERE `word` IN ('word1', 'word2', 'word3');
      
    2. Second, you should find sentence_id values from table sentences that have required word_id values (corresponding to the words from words table). The initial query could look like this:

      SELECT `sentence_id`, `word_id` FROM `sentences_words`
      WHERE `word_id` IN ([here goes list of words' ids]);
      

      which could be simplified to this:

      SELECT `sentence_id`, `word_id` FROM `sentences_words`
      WHERE `word_id` IN (SELECT `id` FROM `words` WHERE `word` IN ('word1', 'word2', 'word3')
      );
      
    3. Filter the result within Python to return only sentence_id values that have all the required word_id IDs you need.

This is basically a solution based on storing big amount of data in the form that is best suited for this - the database.

EDIT:

  1. If you will only search for two words, you can do even more (almost everything) on DBMS' side.
  2. Considering you need also position difference, you should store the position of the word within third column of sentences_words table (lets call it just position) and when searching for appropriate words, you should calculate difference of this value associated with both words.
https://en.xdnf.cn/q/70621.html

Related Q&A

python libclang bindings on Windows fail to initialize a translation unit from sublime text

Short description: using libclang to autocomplete code does not work with python that comes bundled with Sublime Text 3.Details: A small verifiable example is in the repo on GithubIn essence, there is …

How to create a simple Gradient Descent algorithm

Im studying simple machine learning algorithms, beginning with a simple gradient descent, but Ive got some trouble trying to implement it in python. Here is the example Im trying to reproduce, Ive got …

login_required decorator on a class based view in django

I have a working class based view. But when adding @login_required I get the error:AttributeError: function object has no attribute as_viewSomething is happening to the ResultListView here:from django.…

Generic way to get primary key from declaratively defined instance in SQLAlchemy

Does SQLAlchemy offer a generic way to get the primary key from a declaratively defined instance, so that if:Base = declarative_base()class MyClass(Base):__tablename__ = mytablekey = Column(Integer, pr…

Add column after another column

How can I add a column after another column to a database using Alembic or SQLAlchemy? That would be equivalent to this SQL clause: ALTER TABLE foo CHANGE COLUMN bar bar COLUMN_DEFINITION_HERE AFTER …

Scrapy Images Downloading

My spider runs without displaying any errors but the images are not stored in the folder here are my scrapy files:Spider.py:import scrapy import re import os import urlparse from scrapy.spiders import …

A full and minimal example for a class (not method) with Python C Extension?

Everywhere, I can easily find an example about writing a method with Python C Extensions and use it in Python. Like this one: Python 3 extension example$ python3 >>> import hello >>> …

Python: Grouping into timeslots (minutes) for days of data

I have a list of events that occur at mS accurate intervals, that spans a few days. I want to cluster all the events that occur in a per-n-minutes slot (can be twenty events, can be no events). I have …

signal.alarm not triggering exception on time

Ive slightly modified the signal example from the official docs (bottom of page).Im calling sleep 10 but I would like an alarm to be raised after 1 second. When I run the following snippet it takes way…

Execute Python (selenium) script in crontab

I have read most of the python/cron here in stackoverflow and yet couldnt make my script run. I understood that I need to run my script through shell (using zsh & ipython by the way), but really I …