Storing an inverted index

2024/4/14 9:30:13

I am working on a project on Info Retrieval. I have made a Full Inverted Index using Hadoop/Python. Hadoop outputs the index as (word,documentlist) pairs which are written on the file. For a quick access, I have created a dictionary(hashtable) using the above file. My question is, how do I store such an index on disk that also has quick access time. At present I am storing the dictionary using python pickle module and loading from it but it brings the whole of index into memory at once (or does it?). Please suggest an efficient way of storing and searching through the index.

My dictionary structure is as follows (using nested dictionaries)

{word : {doc1:[locations], doc2:[locations], ....}}

so that I can get the documents containing a word by dictionary[word].keys() ... and so on.



At present I am storing the dictionary using python pickle module and loading from it but it brings the whole of index into memory at once (or does it?).

Yes it does bring it all in.

Is that a problem? If it's not an actual problem, then stick with it.

If it's a problem, what kind of problem do you have? Too slow? Too fast? Too colorful? Too much memory used? What problem do you have?

Related Q&A

How to determine whether java is installed on a system through python?

Using Python, I want to know whether Java is installed.

How should I save the model of PyTorch if I want it loadable by OpenCV dnn module

I train a simple classification model by PyTorch and load it by opencv3.3, but it throw exception and sayOpenCV Error: The function/feature is not implemented (Unsupported Lua type) in readObject, file…

Apache Spark ALS - how to perform Live Recommendations / fold-in anonym user

I am using Apache Spark (Pyspark API for Python) ALS MLLIB to develop a service that performs live recommendations for anonym users (users not in the training set) in my site. In my usecase I train th…

python JIRA connection with proxy

Im trying to connect via python-jira using a proxy:server = {"server": "https://ip:port/jira",proxies: {"http": "http://ip:port", "https": "http:/…

How can I iterate over only the first variable of a tuple

In python, when you have a list of tuples, you can iterate over them. For example when you have 3d points then:for x,y,z in points:pass# do something with x y or zWhat if you only want to use the first…

Bottle with Gunicorn

What is the difference between running bottle script like thisfrom bottle import route, run@route(/) def index():return Hello!run(server=gunicorn, host=, port=8080)with command python and…

Run several python programs at the same time

I have python script do(i):# doing something with i, that takes timestart_i = sys.argv[1] end_i = sys.argv[2] for i in range(start_i, end_i):do(i)Then I run this script:python 0 10000…

Using python, what is the most accurate way to auto determine a users current timezone

I have verified that does not work on heroku and even if it did, wouldnt it just get the tz from the OS of the computer its on, not necessarly the users?Short of storing a users…

ImportError: cannot import name ParseMode from telegram

I am trying to create a telegram bot. The code i am trying to execute is : from telegram import ParseModeBut it is throwing up this error: ImportError: cannot import name ParseMode from telegram (C:\Pr…

Executing bash with subprocess.Popen

Im trying to write a wrapper for a bash session using python. The first thing I did was just try to spawn a bash process, and then try to read its output. like this:from subprocess import Popen, PIPE b…