Question 1

I am implementing the tf-idf algorithm in a web application using Python, however it runs extremely slow. What I basically do is:

1) Create 2 dictionaries:

First dictionary: key (document id), value (list of all found words (incl. repeated) in doc)
Second dictionary; key (document id), value (set containing unique words of the doc)

Now, there is a petition of a user to get tfidf results of document d. What I do is:

2) Loop over the unique words of the second dictionary for the document d, and for each unique word w get:

2.1) tf score (how many times w appears in d: loop over the the list of words of the first dictionary for the document)

2.2) df score (how many docs contain w: looping over the set of words of all documents (second dictionary) and check if w is contained). I am using a set since it seems to be faster for checking if a set contains a word compared to a list.

Step 2.2 is terribly slow. For example, having 1000 documents, and for a document with 2313 unique words, it takes around 5 minutes to output the results.

Is there any other way to make step 2.2 faster? Are dictionaries that slow for iterating?

Question 2

Well, you have to re-think and re-design somehow the way you hold your data, or in other words, implement an "orthodox" version of your "inverted index".

Your bottleneck is the "on-the-fly" calculation of the document frequency (DF) for the terms. It would be a clever idea for this to be dynamic, so every time you update your corpus (collection of documents), do some processing and update the DFs for every term in a document (and of course, save the results in a persistent way, aka a database etc..) .

The only structure you need is a nested dictionary like that

{ "term1" : { "DF" : x, "some_doc_id" : tf , "some_other_doc_id" : tf, etc  } ,"term2" : ...etc..
}

properly updated every time you "feed" your corpus.

And of course, keep somewhere your corpus cardinality...

As a hobby and part of my work, I am implementing a python - redis backed small search engine. You might get some other ideas as well. Take a look here.

Python and tfidf algorithm, make it faster?

Related Q&A

How to use Python to find all isbn in a text file?

AWS Batch Job Execution Results in Step Function

Simplify Django test set up with mock objects

How to use tf.data.Dataset.padded_batch with a nested shape?

Python, thread and gobject

How to type annotate overrided methods in a subclass?

Import Error: No module named pytz after using easy_install

Python catch exception pandas.errors.ParserError: Error tokenizing data. C error

Nested tags in BeautifulSoup - Python

How do I check if a string is a negative number before passing it through int()?