Python - Convert Very Large (6.4GB) XML files to JSON

2024/10/4 1:19:28

Essentially, I have a 6.4GB XML file that I'd like to convert to JSON then save it to disk. I'm currently running OSX 10.8.4 with an i7 2700k and 16GBs of ram, and running Python 64bit (double checked). I'm getting an error that I don't have enough memory to allocate. How do I go about fixing this?

print 'Opening'
f = open('large.xml', 'r')
data = f.read()
f.close()print 'Converting'
newJSON = xmltodict.parse(data)print 'Json Dumping'
newJSON = json.dumps(newJSON)print 'Saving'
f = open('newjson.json', 'w')
f.write(newJSON)
f.close()

The Error:

Python(2461) malloc: *** mmap(size=140402048315392) failed (error code=12)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
Traceback (most recent call last):File "/Users/user/Git/Resources/largexml2json.py", line 10, in <module>data = f.read()
MemoryError
Answer

Many Python XML libraries support parsing XML sub elements incrementally, e.g. xml.etree.ElementTree.iterparse and xml.sax.parse in the standard library. These functions are usually called "XML Stream Parser".

The xmltodict library you used also has a streaming mode. I think it may solve your problem

https://github.com/martinblech/xmltodict#streaming-mode

https://en.xdnf.cn/q/70665.html

Related Q&A

Python create tree from a JSON file

Lets say that we have the following JSON file. For the sake of the example its emulated by a string. The string is the input and a Tree object should be the output. Ill be using the graphical notation …

disable `functools.lru_cache` from inside function

I want to have a function that can use functools.lru_cache, but not by default. I am looking for a way to use a function parameter that can be used to disable the lru_cache. Currently, I have a two ver…

How to clear tf.flags?

If I run this code twice:tf.flags.DEFINE_integer("batch_size", "2", "batch size for training")I will get this error:DuplicateFlagError: The flag batch_size is defined twic…

Stochastic Optimization in Python

I am trying to combine cvxopt (an optimization solver) and PyMC (a sampler) to solve convex stochastic optimization problems. For reference, installing both packages with pip is straightforward: pip in…

Pandas convert yearly to monthly

Im working on pulling financial data, in which some is formatted in yearly and other is monthly. My model will need all of it monthly, therefore I need that same yearly value repeated for each month. …

Firebase database data to R

I have a database in Google Firebase that has streaming sensor data. I have a Shiny app that needs to read this data and map the sensors and their values.I am trying to pull the data from Firebase into…

Django 1.8 Migrations - NoneType object has no attribute _meta

Attempting to migrate a project from Django 1.7 to 1.8. After wrestling with code errors, Im able to get migrations to run. However, when I try to migrate, Im given the error "NoneType object has …

Manage dependencies of git submodules with poetry

We have a repository app-lib that is used as sub-module in 4 other repos and in each I have to add all dependencies for the sub-module. So if I add/remove a dependency in app-lib I have to adjust all o…

Create Boxplot Grouped By Column

I have a Pandas DataFrame, df, that has a price column and a year column. I want to create a boxplot after grouping the rows based on their year. Heres an example: import pandas as pd temp = pd.DataF…

How can I configure gunicorn to use a consistent error log format?

I am using Gunicorn in front of a Python Flask app. I am able to configure the access log format using the --access-log-format command line parameter when I run gunicorn. But I cant figure out how to c…