Store large dictionary to file in Python

2024/9/23 13:31:00

I have a dictionary with many entries and a huge vector as values. These vectors can be 60.000 dimensions large and I have about 60.000 entries in the dictionary. To save time, I want to store this after computation. However, using a pickle led to a huge file. I have tried storing to JSON, but the file remains extremely large (like 10.5 MB on a sample of 50 entries with less dimensions). I have also read about sparse matrices. As most entries will be 0, this is a possibility. Will this reduce the filesize? Is there any other way to store this information? Or am I just unlucky?

Update:

Thank you all for the replies. I want to store this data as these are word counts. For example, when given sentences, I store the amount of times word 0 (at location 0 in the array) appears in the sentence. There are obviously more words in all sentences than appear in one sentence, hence the many zeros. Then, I want to use this array tot train at least three, maybe six classifiers. It seemed easier to create the arrays with word counts and then run the classifiers over night to train and test. I use sklearn for this. This format was chosen to be consistent with other feature vector formats, which is why I am approaching the problem this way. If this is not the way to go, in this case, please let me know. I am very much aware that I have much to learn in coding efficiently!

I also started implementing sparse matrices. The file is even bigger now (testing with a sample set of 300 sentences).

Update 2: Thank you all for the tips. John Mee was right by not needing to store the data. Both he and Mike McKerns told me to use sparse matrices, which sped up calculation significantly! So thank you for your input. Now I have a new tool in my arsenal!

Answer

See my answer to a very closely related question https://stackoverflow.com/a/25244747/2379433, if you are ok with pickling to several files instead of a single file.

Also see: https://stackoverflow.com/a/21948720/2379433 for other potential improvements, and here too: https://stackoverflow.com/a/24471659/2379433.

If you are using numpy arrays, it can be very efficient, as both klepto and joblib understand how to use minimal state representation for an array. If you indeed have most elements of the arrays as zeros, then by all means, convert to sparse matrices... and you will find huge savings in storage size of the array.

As the links above discuss, you could use klepto -- which provides you with the ability to easily store dictionaries to disk or database, using a common API. klepto also enables you to pick a storage format (pickle, json, etc.) -- where HDF5 is coming soon. It can utilize both specialized pickle formats (like numpy's) and compression (if you care about size and not speed).

klepto gives you the option to store the dictionary with "all-in-one" file or "one-entry-per" file, and also can leverage multiprocessing or multithreading -- meaning that you can save and load dictionary items to/from the backend in parallel.

https://en.xdnf.cn/q/71820.html

Related Q&A

Python: override __str__ in an exception instance

Im trying to override the printed output from an Exception subclass in Python after the exception has been raised and Im having no luck getting my override to actually be called.def str_override(self):…

How hide/show a field upon selection of a radio button in django admin?

models.pyfrom django.db import models from django.contrib.auth.models import UserSTATUS_CHOICES = ((1, Accepted),(0, Rejected),) class Leave(models.Model):----------------status = models.IntegerField(c…

format/round numerical legend label in GeoPandas

Im looking for a way to format/round the numerical legend labels in those maps produced by .plot() function in GeoPandas. For example:gdf.plot(column=pop2010, scheme=QUANTILES, k=4)This gives me a lege…

Python pickle crash when trying to return default value in __getattr__

I have a dictionary like class that I use to store some values as attributes. I recently added some logic(__getattr__) to return None if an attribute doesnt exist. As soon as I did this pickle crashe…

How to download google source code for android

As you know, there is a list of several hundred projects in https://android.googlesource.com/. Id like to download them all in windows machine. According to Googles document,To install, initialize, and…

Compute on pandas dataframe concurrently

Is it feasible to do multiple group-wise calculation in dataframe in pandas concurrently and get those results back? So, Id like to compute the following sets of dataframe and get those results one-by…

How do I go about writing a program to send and receive sms using python?

I have looked all over the net for a good library to use in sending and receiving smss using python but all in vain!Are there GSM libraries for python out there?

Persist Completed Pipeline in Luigi Visualiser

Im starting to port a nightly data pipeline from a visual ETL tool to Luigi, and I really enjoy that there is a visualiser to see the status of jobs. However, Ive noticed that a few minutes after the l…

How to assign python requests sessions for single processes in multiprocessing pool?

Considering the following code example:import multiprocessing import requestssession = requests.Session() data_to_be_processed = [...]def process(arg):# do stuff with arg and get urlresponse = session.…

Missing values in Pandas Pivot table?

I have a data set that looks like the following:student question answer number Bob How many donuts in a dozen? A 1 Sally How many donuts in a do…