How to write large JSON data?

2024/5/20 12:37:34

I have been trying to write large amount (>800mb) of data to JSON file; I did some fair amount of trial and error to get this code:

def write_to_cube(data):with open('test.json') as file1:temp_data = json.load(file1)temp_data.update(data)file1.close()with open('test.json', 'w') as f:json.dump(temp_data, f)f.close()

to run it just call the function write_to_cube({"some_data" = data})

Now the problem with this code is that it's fast for the small amount of data, but the problem comes when test.json file has more than 800mb in it. When I try to update or add data to it, it takes ages.

I know there are external libraries such as simplejson or jsonpickle, I am not pretty sure on how to use them.

Is there any other way to this problem?

Update:

I am not sure how this can be a duplicate, other articles say nothing about writing or updating a large JSON file, rather they say only about parsing.

Is there a memory efficient and fast way to load big json files in python?

Reading rather large json files in Python

None of the above resolve this question a duplicate. They don't say anything about writing or update.

Answer

I found the json-stream package which might be able to help. While it does provide the mechanics for stepping over the input JSON and streaming Python data structures to an output JSON file, without concrete details from OP it's hard to say if this would have met their needs.

Just to see if it actually has any memory advantage in processing large files, I've mocked up this basic JSON:

{"0": {"foo": "bar"},"1": {"foo": "bar"},"2": {"foo": "bar"},"3": {"foo": "bar"},...

up to 10M objects:

    ..."9999997": {"foo": "bar"},"9999998": {"foo": "bar"},"9999999": {"foo": "bar"},
}

and I've made up the requirement to change every odd object to {"foo": "BAR"}:

{"0": {"foo": "bar"},"1": {"foo": "BAR"},"2": {"foo": "bar"},"3": {"foo": "BAR"},..."9999997": {"foo": "BAR"},"9999998": {"foo": "bar"},"9999999": {"foo": "BAR"},
}

I'm certain this is more trivial than what OP needed to do by passing an update dict (which I imagine to have a moderately "deep" structure).

I've written scripts to handle the generation, reading, and transforming of some test articles:

  • generate:

    @streamable_dict
    def yield_obj(n: int):for x in range(n):yield str(x), {"foo": "bar"}def gen_standard(n: int):with open(f"gen/{n}.json", "w") as f:obj = dict(list(yield_obj(n)))json.dump(obj, f, indent=1)def gen_stream(n: int):with open(f"gen/{n}.json", "w") as f:json.dump(yield_obj(n), f, indent=1)
    

    yield_obj() is an iterator that can be materialized with dict(list(...)), and be streamed to the standard json.dump() method with the help of the @streamable_dict wrapper.

    Makes three test files:

    -rw-r--r--   1 zyoung  staff   2.9M Feb 23 17:24 100000.json
    -rw-r--r--   1 zyoung  staff    30M Feb 23 17:24 1000000.json
    -rw-r--r--   1 zyoung  staff   314M Feb 23 17:24 10000000.json
    
  • read, which just loads and passes over everything:

    def read_standard(fname: str):with open(fname) as f:for _ in json.load(f):passdef read_stream(fname: str):with open(fname) as f:for _ in json_stream.load(f):pass
    
  • transform, which applies my silly "uppercase every odd BAR":

    def transform_standard(fname: str):with open(fname) as f_in:data = json.load(f_in)for key, value in data.items():if int(key) % 2 == 1:value["foo"] = "BAR"with open(out_name(fname), "w") as f_out:json.dump(data, f_out, indent=1)def transform_stream(fname: str):@streamable_dictdef update(data):for key, value in data.items():value = json_stream.to_standard_types(value)if int(key) % 2 == 1:value["foo"] = "BAR"yield key, valuewith open(fname) as f_in:data = json_stream.load(f_in)updated_data = update(data)with open(out_name(fname), "w") as f_out:json.dump(updated_data, f_out, indent=1)
    

    @streamable_dict is used again to turn the update() iterator into a streamable "thing" that can be passed to the standard json.dump() method.

The complete code and the runners are in this Gist.

The stats show that json-stream has a flat memory curve for testing 100_000, 1_000_000, and 10_000_000 objects. It does take more time to read and transform, though:

Generate

Method Items Real (s) User (s) Sys (s) Mem (MB)
standard 1e+05 0.19 0.17 0.01 45.84
standard 1e+06 2.00 1.93 0.06 372.97
standard 1e+07 21.67 20.46 1.03 3480.29
stream 1e+05 0.18 0.15 0.00 7.28
stream 1e+06 1.43 1.41 0.02 7.69
stream 1e+07 14.41 14.07 0.20 7.58

Read

Method Items Real (s) User (s) Sys (s) Mem (MB)
standard 1e+05 0.05 0.04 0.01 48.28
standard 1e+06 0.58 0.50 0.05 390.17
standard 1e+07 7.69 6.73 0.80 3875.81
stream 1e+05 0.32 0.31 0.01 7.70
stream 1e+06 2.96 2.94 0.02 7.69
stream 1e+07 29.88 29.65 0.17 7.77

Transform

Method Items Real (s) User (s) Sys (s) Mem (MB)
standard 1e+05 0.19 0.17 0.01 48.05
standard 1e+06 1.83 1.75 0.07 388.83
standard 1e+07 20.16 19.15 0.91 3875.49
stream 1e+05 0.63 0.61 0.01 7.61
stream 1e+06 6.06 6.02 0.03 7.92
stream 1e+07 61.44 60.89 0.35 8.44
https://en.xdnf.cn/q/72857.html

Related Q&A

Using absolute_import and handling relative module name conflicts in python [duplicate]

This question already has answers here:How can I import from the standard library, when my project has a module with the same name? (How can I control where Python looks for modules?)(7 answers)Close…

Setting results of torch.gather(...) calls

I have a 2D pytorch tensor of shape n by m. I want to index the second dimension using a list of indices (which could be done with torch.gather) then then also set new values to the result of the index…

Why does PyCharm use double backslash to indicate escaping?

For instance, I write a normal string and another "abnormal" string like this:Now I debug it, finding that in the debug tool, the "abnormal" string will be shown like this:Heres the…

Execute Shell Script from Python with multiple pipes

I want to execute the following Shell Command in a python script:dom=myserver cat /etc/xen/$myserver.cfg | grep limited | cut -d= -f2 | tr -d \"I have this:dom = myserverlimit = subprocess.cal…

Python: Is there a way to split a string of numbers into every 3rd number?

For example, if I have a string a=123456789876567543 could i have a list like...123 456 789 876 567 543

How to override the __dir__ method for a class?

I want to change the dir() output for my class. Normally, for all other objects, its done by defining own __dir__ method in their class. But if I do this for my class, its not called.class X(object):de…

Equivalent of wget in Python to download website and resources

Same thing asked 2.5 years ago in Downloading a web page and all of its resource files in Python but doesnt lead to an answer and the please see related topic isnt really asking the same thing.I want t…

Python plt: close or clear figure does not work

I generate a lots of figures with a script which I do not display but store to harddrive. After a while I get the message/usr/lib/pymodules/python2.7/matplotlib/pyplot.py:412: RuntimeWarning: More than…

PCA of RGB Image

Im trying to figure out how to use PCA to decorrelate an RGB image in python. Im using the code found in the OReilly Computer vision book:from PIL import Image from numpy import *def pca(X):# Principa…

delete the first element in subview of a matrix

I have a dataset like this:[[0,1],[0,2],[0,3],[0,4],[1,5],[1,6],[1,7],[2,8],[2,9]]I need to delete the first elements of each subview of the data as defined by the first column. So first I get all elem…