Question 1

My dataset is composed of arXiv astrophysics articles as .tex files, and I need to extract only text from the article body, not from any other part of the article (e.g. tables, figures, abstract, title, footnotes, acknowledgements, citations, etc.).

I've been trying with Python3 and tex2py, but I'm struggling with getting a clean corpus, because the files differ in labeling & the text is broken up between labels.

I have attached a SSCCE, a couple sample Latex files and their pdfs, and the parsed corpus. The corpus shows my struggles: Sections and subsections are not extracted in order, text breaks at some labels, and some tables and figures are included.

Code:

import os
from tex2py import tex2pycorpus = open('corpus2.tex', 'a')def parseFiles():"""Parses downloaded document .tex files for word content.We are only interested in the article body, defined by /section tags."""for file in os.listdir("latex"):if file.endswith('.tex'):print('\nChecking ' + file + '...')with open("latex/" + file) as f:try:toc = tex2py(f) # toc = tree of contents# If file is a document, defined as having \begin{document}if toc.source.document:# Iterate over each section in documentfor section in toc:# Parse the sectiongetText(section)else:print(file + ' is not a document. Discarded.')except (EOFError, TypeError, UnicodeDecodeError): print('Error: ' + file + ' was not correctly formatted. Discarded.')def getText(section):"""Extracts text from given "section" node and any nested "subsection" nodes. Parameters----------section : listA "section" node in a .tex document """# For each element within the section for x in section:if hasattr(x.source, 'name'):# If it is a subsection or subsubsection, parse itif x.source.name == 'subsection' or x.source.name == 'subsubsection':corpus.write('\nSUBSECTION!!!!!!!!!!!!!\n')getText(x)# Avoid parsing past these sectionselif x.source.name == 'acknowledgements' or x.source.name == 'appendix':return# If element is text, add it to corpuselif isinstance(x.source, str):# If element is inline math, worry about it laterif x.source.startswith('$') and x.source.endswith('$'):continuecorpus.write(str(x))# If element is 'RArg' labelled, e.g. \em for italic, add it to corpuselif type(x.source).__name__ is 'RArg':corpus.write(str(x.source))if __name__ == '__main__':"""Runs if script called on command line"""parseFiles()

Links to the rest:

Sample .tex file 1 and its pdf
Sample .tex file 2 and its pdf
Resulting corpus

I'm aware of a related question (Programatically converting/parsing latex code to plain text), but there seems not to be a conclusive answer.

Question 2

To grab all text from a document, tree.descendants will be a lot more friendly here. This will output all text in order.

def getText(section):for token in section.descendants:if isinstance(token, str):corpus.write(str(x))

To capture the edge cases, I wrote a slightly more fleshed-out version. This includes checks for all the conditions you've listed up there.

from TexSoup import RArgdef getText(section):for x in section.descendants:if isinstance(x, str):if x.startswith('$') and x.endswith('$'):continuecorpus.write(str(x))elif isinstance(x, RArg):corpus.write(str(x))elif hasattr(x, 'source') and hasattr(x.source, 'name') and x.source.name in ('acknowledgements', 'appendix'):return

Extract only body text from arXiv articles formatted as .tex

Related Q&A

why is python reusing a class instance inside in function

How to set locale in Altair?

Show/hide a plots legend

Difference between iterating over a file-like and calling readline

Creating `input_fn` from iterator

A Python one liner? if x in y, do x

Adjusting the ticks to fit within the figure

Python ctypes: pass argument by reference error

Python: Print next x lines from text file when hitting string

Writing to a Google Document With Python