I am implementing the tf-idf algorithm in a web application using Python, however it runs extremely slow. What I basically do is:
1) Create 2 dictionaries:
- First dictionary: key (document id), value (list of all found words (incl. repeated) in doc)
- Second dictionary; key (document id), value (set containing unique words of the doc)
Now, there is a petition of a user to get tfidf results of document d. What I do is:
2) Loop over the unique words of the second dictionary for the document d, and for each unique word w get:
2.1) tf score (how many times w appears in d: loop over the the list of words of the first dictionary for the document)
2.2) df score (how many docs contain w: looping over the set of words of all documents (second dictionary) and check if w is contained). I am using a set since it seems to be faster for checking if a set contains a word compared to a list.
Step 2.2 is terribly slow. For example, having 1000 documents, and for a document with 2313 unique words, it takes around 5 minutes to output the results.
Is there any other way to make step 2.2 faster? Are dictionaries that slow for iterating?