How can I process xml asynchronously in python?

2024/10/1 3:24:07

I have a large XML data file (>160M) to process, and it seems like SAX/expat/pulldom parsing is the way to go. I'd like to have a thread that sifts through the nodes and pushes nodes to be processed onto a queue, and then other worker threads pull the next available node off the queue and process it.

I have the following (it should have locks, I know - it will, later)

import sys, time
import xml.parsers.expat
import threadingq = []def start_handler(name, attrs):q.append(name)def do_expat():p = xml.parsers.expat.ParserCreate()p.StartElementHandler = start_handlerp.buffer_text = Trueprint("opening {0}".format(sys.argv[1]))with open(sys.argv[1]) as f:print("file is open")p.ParseFile(f)print("parsing complete")t = threading.Thread(group=None, target=do_expat)
t.start()while True:print(q)time.sleep(1)

The problem is that the body of the while block gets called only once, and then I can't even ctrl-C interrupt it. On smaller files, the output is as expected, but that seems to indicate that the handler only gets called when the document is fully parsed, which seems to defeat the purpose of a SAX parser.

I'm sure it's my own ignorance, but I don't see where I'm making the mistake.

PS: I also tried changing start_handler thus:

def start_handler(name, attrs):def app():q.append(name)u = threading.Thread(group=None, target=app)u.start()

No love, though.

Answer

ParseFile, as you've noticed, just "gulps down" everything -- no good for the incremental parsing you want to do! So, just feed the file to the parser a bit at a time, making sure to conditionally yield control to other threads as you go -- e.g.:

while True:data = f.read(BUFSIZE)if not data:p.Parse('', True)breakp.Parse(data, False)time.sleep(0.0)

the time.sleep(0.0) call is Python's way to say "yield to other threads if any are ready and waiting"; the Parse method is documented here.

The second point is, forget locks for this usage! -- use Queue.Queue instead, it's intrinsically threadsafe and almost invariably the best and simplest way to coordinate multiple threads in Python. Just make a Queue instance q, q.put(name) on it, and have worked threads block on q.get() waiting to get some more work to do -- it's SO simple!

(There are several auxiliary strategies you can use to coordinate the termination of worker threads when there's no more work for them to do, but the simplest, absent special requirements, is to just make them daemon threads, so they will all terminate when the main thread does -- see the docs).

https://en.xdnf.cn/q/71005.html

Related Q&A

python postgresql: reliably check for updates in a specific table

Situation: I have a live trading script which computes all sorts of stuff every x minutes in my main thread (Python). the order sending is performed through such thread. the reception and execution of …

How to push to remote repo with GitPython

I have to clone a set of projects from one repository and push it then to a remote repository automatically. Therefore im using python and the specific module GitPython. Until now i can clone the proje…

How do I do use non-integer string labels with SVM from scikit-learn? Python

Scikit-learn has fairly user-friendly python modules for machine learning.I am trying to train an SVM tagger for Natural Language Processing (NLP) where my labels and input data are words and annotatio…

Python - walk through a huge set of files but in a more efficient manner

I have huge set of files that I want to traverse through using python. I am using os.walk(source) for the same and is working but since I have a huge set of files it is taking too much and memory resou…

Python: handling a large set of data. Scipy or Rpy? And how?

In my python environment, the Rpy and Scipy packages are already installed. The problem I want to tackle is such:1) A huge set of financial data are stored in a text file. Loading into Excel is not pos…

Jupyter notebook - cant import python functions from other folders

I have a Jupyter notebook, I want to use local python functions from other folders in my computer. When I do import to these functions I get this error: "ModuleNotFoundError: No module named xxxxx…

Can pandas plot a time-series without trying to convert the index to Periods?

When plotting a time-series, I observe an unusual behavior, which eventually results in not being able to format the xticks of the plot. It seems that pandas internally tries to convert the index into …

pip install syntax for allowing insecure

I tried to run$pip install --upgrade --allow-insecure setuptoolsbut it doesnt seem to work? is my syntax wrong?this is on ubuntu 13.10 I need --allow-insecure as I havent been able to the get the co…

how do I determine the locations of the points after perspective transform, in the new image plane?

Im using OpenCV+Python+Numpy and I have three points in the image, I know the exact locations of those points.(P1, P2);N1I am going to transform the image to another view, (for example I am transformin…

How to do a simple Gaussian mixture sampling and PDF plotting with NumPy/SciPy?

I add three normal distributions to obtain a new distribution as shown below, how can I do sampling according to this distribution in python?import matplotlib.pyplot as plt import scipy.stats as ss im…