Extract only body text from arXiv articles formatted as .tex

2024/6/30 15:55:33

My dataset is composed of arXiv astrophysics articles as .tex files, and I need to extract only text from the article body, not from any other part of the article (e.g. tables, figures, abstract, title, footnotes, acknowledgements, citations, etc.).

I've been trying with Python3 and tex2py, but I'm struggling with getting a clean corpus, because the files differ in labeling & the text is broken up between labels.

I have attached a SSCCE, a couple sample Latex files and their pdfs, and the parsed corpus. The corpus shows my struggles: Sections and subsections are not extracted in order, text breaks at some labels, and some tables and figures are included.

Code:

import os
from tex2py import tex2pycorpus = open('corpus2.tex', 'a')def parseFiles():"""Parses downloaded document .tex files for word content.We are only interested in the article body, defined by /section tags."""for file in os.listdir("latex"):if file.endswith('.tex'):print('\nChecking ' + file + '...')with open("latex/" + file) as f:try:toc = tex2py(f) # toc = tree of contents# If file is a document, defined as having \begin{document}if toc.source.document:# Iterate over each section in documentfor section in toc:# Parse the sectiongetText(section)else:print(file + ' is not a document. Discarded.')except (EOFError, TypeError, UnicodeDecodeError): print('Error: ' + file + ' was not correctly formatted. Discarded.')def getText(section):"""Extracts text from given "section" node and any nested "subsection" nodes. Parameters----------section : listA "section" node in a .tex document """# For each element within the section for x in section:if hasattr(x.source, 'name'):# If it is a subsection or subsubsection, parse itif x.source.name == 'subsection' or x.source.name == 'subsubsection':corpus.write('\nSUBSECTION!!!!!!!!!!!!!\n')getText(x)# Avoid parsing past these sectionselif x.source.name == 'acknowledgements' or x.source.name == 'appendix':return# If element is text, add it to corpuselif isinstance(x.source, str):# If element is inline math, worry about it laterif x.source.startswith('$') and x.source.endswith('$'):continuecorpus.write(str(x))# If element is 'RArg' labelled, e.g. \em for italic, add it to corpuselif type(x.source).__name__ is 'RArg':corpus.write(str(x.source))if __name__ == '__main__':"""Runs if script called on command line"""parseFiles()

Links to the rest:

  • Sample .tex file 1 and its pdf
  • Sample .tex file 2 and its pdf
  • Resulting corpus

I'm aware of a related question (Programatically converting/parsing latex code to plain text), but there seems not to be a conclusive answer.

Answer

To grab all text from a document, tree.descendants will be a lot more friendly here. This will output all text in order.

def getText(section):for token in section.descendants:if isinstance(token, str):corpus.write(str(x))

To capture the edge cases, I wrote a slightly more fleshed-out version. This includes checks for all the conditions you've listed up there.

from TexSoup import RArgdef getText(section):for x in section.descendants:if isinstance(x, str):if x.startswith('$') and x.endswith('$'):continuecorpus.write(str(x))elif isinstance(x, RArg):corpus.write(str(x))elif hasattr(x, 'source') and hasattr(x.source, 'name') and x.source.name in ('acknowledgements', 'appendix'):return
https://en.xdnf.cn/q/73258.html

Related Q&A

why is python reusing a class instance inside in function

Im running a for loop inside a function which is creating instances of a class to test them. instead of making new classes it appears to be reusing the same two over and over.Is there something Im miss…

How to set locale in Altair?

Im successfully creating and rendering a chart in Altair with a currency prefix ($), but I need this to be set to GBP (£). I know that theres a Vega-lite formatLocale which can be set, but I cant …

Show/hide a plots legend

Im relatively new to python and am developing a pyqt GUI. I want to provide a checkbox option to show/hide a plots legend. Is there a way to hide a legend? Ive tried using pyplots _nolegend_ and it ap…

Difference between iterating over a file-like and calling readline

I always thought iterating over a file-like in Python would be equivalent to calling its readline method in a loop, but today I found a situation where that is not true. Specifically, I have a Popend p…

Creating `input_fn` from iterator

Most tutorials focus on the case where the entire training dataset fits into memory. However, I have an iterator which acts as an infinite stream of (features, labels)-tuples (creating them cheaply on …

A Python one liner? if x in y, do x

numbers = [1,2,3,4,5,6,7,8,9] number = 1Can I write the following on one line?if number in numbers:print numberUsing the style of ruby:puts number if numbers.include?(number)I have tried:print number…

Adjusting the ticks to fit within the figure

I have the following matplotlib code which all it does is plots 0-20 on the x-axis vs 0-100 on the y-axisimport matplotlib.pyplot as plt fig = plt.figure() ax = fig.add_subplot(111) ax.plot(range(20)) …

Python ctypes: pass argument by reference error

I have a C++ function that I want you call in Python 2.7.12, looking like this:extern "C" {double* myfunction(double* &y, double* &z, int &n_y, int &n_z, int a, int b){vector&…

Python: Print next x lines from text file when hitting string

The situation is as follows:I have a .txt file with results of several nslookups.I want to loop tru the file and everytime it hits the string "Non-authoritative answer:" the scripts has to pr…

Writing to a Google Document With Python

I have some data that I want to write to a simple multi-column table in Google Docs. Is this way too cumbersome to even begin attempting? I would just render it in XHTML, but my client has a very spec…