Question 1

I'm working with the NYT corpus in Python and attempting to extract only what's located inside "full_text" class of every .xml article file. For example:

<body.content><block class="lead_paragraph"><p>LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.</p></block><block class="full_text"><p>LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.</p></block>

Ideally, I'd like to parse out only the string, yielding "LEAD: Two police officers responding to a reported robbery..." but I'm unsure of what the best approach would be. Is this something that can be easily parsed by regex? If so, nothing I've attempted seems to work.

Any advice would be appreciated!

Question 2

You could use BeautifulSoup parser also.

>>> from bs4 import BeautifulSoup
>>> s = '''<body.content><block class="lead_paragraph"><p>LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.</p></block><block class="full_text"><p>LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.</p></block>'''
>>> soup = BeautifulSoup(s)
>>> for i in soup.findAll('block', class_="full_text"):print(i.text)LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.

Extract text inside XML tags with in Python (while avoiding p tags)

Related Q&A

Python (Flask) and MQTT listening

I dont show the image in Tkinter

Read temperature with MAX31855 Thermocouple Sensor on Windows IoT

PIP: How to Cascade Requirements Files and Use Private Indexes? [duplicate]

Python error: TypeError: cannot compare a dtyped [float64] array with a scalar of type [bool]

How to remove grey boundary lines in a map when plotting a netcdf using imshow in matplotlib?

function that takes one column value and returns another column value

Python write value from dictionary based on match in range of columns

Merge two dataframes based on a column

Python/Pandas - building a new column based in columns comparison