How to split large wikipedia dump .xml.bz2 files in Python?

2024/10/4 9:23:06

I am trying to build a offline wiktionary using the wikimedia dump files (.xml.bz2) using Python. I started with this article as the guide. It involves a number of languages, I wanted to combine all the steps as a single python project. I have found almost all the libraries required for the process. The only hump now is to effectively split the large .xml.bz2 file into number of smaller files for quicker parsing during search operations.

I know that bz2 library exists in python, but it provides only compress and decompress operations. But I need something that could do something like bz2recover does from the command line, which splits large files into a number of smaller junks.

One more important point is the splitting shouldn't split the page contents which start with <page> and ends </page> in the xml document that has been compressed.

Is there a library previously available which could handle this situation or the code has to be written from scratch?(Any outline/pseudo-code would be greatly helpful).

Note: I would like to make the resulting package cross-platform compatible, hence couldn't use OS specific commands.

Answer

At last I have written a Python Script myself:

import os
import bz2def split_xml(filename):''' The function gets the filename of wiktionary.xml.bz2 file as  input and createssmallers chunks of it in a the diretory chunks'''# Check and create chunk diretoryif not os.path.exists("chunks"):os.mkdir("chunks")# Counterspagecount = 0filecount = 1#open chunkfile in write modechunkname = lambda filecount: os.path.join("chunks","chunk-"+str(filecount)+".xml.bz2")chunkfile = bz2.BZ2File(chunkname(filecount), 'w')# Read line by linebzfile = bz2.BZ2File(filename)for line in bzfile:chunkfile.write(line)# the </page> determines new wiki pageif '</page>' in line:pagecount += 1if pagecount > 1999:#print chunkname() # For Debuggingchunkfile.close()pagecount = 0 # RESET pagecountfilecount += 1 # increment filename           chunkfile = bz2.BZ2File(chunkname(filecount), 'w')try:chunkfile.close()except:print 'Files already close'if __name__ == '__main__':# When the script is self runsplit_xml('wiki-files/tawiktionary-20110518-pages-articles.xml.bz2')
https://en.xdnf.cn/q/70626.html

Related Q&A

CherryPy interferes with Twisted shutting down on Windows

Ive got an application that runs Twisted by starting the reactor with reactor.run() in my main thread after starting some other threads, including the CherryPy web server. Heres a program that shuts d…

From subprocess.Popen to multiprocessing

I got a function that invokes a process using subprocess.Popen in the following way:def func():...process = subprocess.Popen(substr, shell=True, stdout=subprocess.PIPE)timeout = {"value": Fal…

Assigning float as a dictionary key changes its precision (Python)

I have a list of floats (actually its a pandas Series object, if it changes anything) which looks like this:mySeries:... 22 16.0 23 14.0 24 12.0 25 10.0 26 3.1 ...(So elements…

Installing jpype in Mountain Lion

I am trying to install jpype in Mountain Lion. I followed all the steps suggested in this post: How to install JPype on OS X Lion to use with Neo4j?However, there is a glitch with Mountain Lion. I hav…

Most efficient way to index words in a document?

This came up in another question but I figured it is best to ask this as a separate question. Give a large list of sentences (order of 100 thousands):[ "This is sentence 1 as an example", &qu…

python libclang bindings on Windows fail to initialize a translation unit from sublime text

Short description: using libclang to autocomplete code does not work with python that comes bundled with Sublime Text 3.Details: A small verifiable example is in the repo on GithubIn essence, there is …

How to create a simple Gradient Descent algorithm

Im studying simple machine learning algorithms, beginning with a simple gradient descent, but Ive got some trouble trying to implement it in python. Here is the example Im trying to reproduce, Ive got …

login_required decorator on a class based view in django

I have a working class based view. But when adding @login_required I get the error:AttributeError: function object has no attribute as_viewSomething is happening to the ResultListView here:from django.…

Generic way to get primary key from declaratively defined instance in SQLAlchemy

Does SQLAlchemy offer a generic way to get the primary key from a declaratively defined instance, so that if:Base = declarative_base()class MyClass(Base):__tablename__ = mytablekey = Column(Integer, pr…

Add column after another column

How can I add a column after another column to a database using Alembic or SQLAlchemy? That would be equivalent to this SQL clause: ALTER TABLE foo CHANGE COLUMN bar bar COLUMN_DEFINITION_HERE AFTER …