My dataset is composed of arXiv astrophysics articles as .tex files, and I need to extract only text from the article body, not from any other part of the article (e.g. tables, figures, abstract, title, footnotes, acknowledgements, citations, etc.).
I've been trying with Python3 and tex2py, but I'm struggling with getting a clean corpus, because the files differ in labeling & the text is broken up between labels.
I have attached a SSCCE, a couple sample Latex files and their pdfs, and the parsed corpus. The corpus shows my struggles: Sections and subsections are not extracted in order, text breaks at some labels, and some tables and figures are included.
Code:
import os
from tex2py import tex2pycorpus = open('corpus2.tex', 'a')def parseFiles():"""Parses downloaded document .tex files for word content.We are only interested in the article body, defined by /section tags."""for file in os.listdir("latex"):if file.endswith('.tex'):print('\nChecking ' + file + '...')with open("latex/" + file) as f:try:toc = tex2py(f) # toc = tree of contents# If file is a document, defined as having \begin{document}if toc.source.document:# Iterate over each section in documentfor section in toc:# Parse the sectiongetText(section)else:print(file + ' is not a document. Discarded.')except (EOFError, TypeError, UnicodeDecodeError): print('Error: ' + file + ' was not correctly formatted. Discarded.')def getText(section):"""Extracts text from given "section" node and any nested "subsection" nodes. Parameters----------section : listA "section" node in a .tex document """# For each element within the section for x in section:if hasattr(x.source, 'name'):# If it is a subsection or subsubsection, parse itif x.source.name == 'subsection' or x.source.name == 'subsubsection':corpus.write('\nSUBSECTION!!!!!!!!!!!!!\n')getText(x)# Avoid parsing past these sectionselif x.source.name == 'acknowledgements' or x.source.name == 'appendix':return# If element is text, add it to corpuselif isinstance(x.source, str):# If element is inline math, worry about it laterif x.source.startswith('$') and x.source.endswith('$'):continuecorpus.write(str(x))# If element is 'RArg' labelled, e.g. \em for italic, add it to corpuselif type(x.source).__name__ is 'RArg':corpus.write(str(x.source))if __name__ == '__main__':"""Runs if script called on command line"""parseFiles()
Links to the rest:
- Sample .tex file 1 and its pdf
- Sample .tex file 2 and its pdf
- Resulting corpus
I'm aware of a related question (Programatically converting/parsing latex code to plain text), but there seems not to be a conclusive answer.