Extract text inside XML tags with in Python (while avoiding p tags)

2024/10/13 21:16:12

I'm working with the NYT corpus in Python and attempting to extract only what's located inside "full_text" class of every .xml article file. For example:

<body.content><block class="lead_paragraph"><p>LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.</p></block><block class="full_text"><p>LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.</p></block>

Ideally, I'd like to parse out only the string, yielding "LEAD: Two police officers responding to a reported robbery..." but I'm unsure of what the best approach would be. Is this something that can be easily parsed by regex? If so, nothing I've attempted seems to work.

Any advice would be appreciated!

Answer

You could use BeautifulSoup parser also.

>>> from bs4 import BeautifulSoup
>>> s = '''<body.content><block class="lead_paragraph"><p>LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.</p></block><block class="full_text"><p>LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.</p></block>'''
>>> soup = BeautifulSoup(s)
>>> for i in soup.findAll('block', class_="full_text"):print(i.text)LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.
https://en.xdnf.cn/q/118033.html

Related Q&A

Python (Flask) and MQTT listening

Im currently trying to get my Python (Flask) webserver to display what my MQTT script is doing. The MQTT script, In essence, its subscribed to a topic and I would really like to categorize the info it …

I dont show the image in Tkinter

The image doesnt show in Tkinter. The same code work in a new window, but in my class it does not. What could be the problem ?import Tkinterroot = Tkinter.Tkclass InterfaceApp(root):def __init__(self,…

Read temperature with MAX31855 Thermocouple Sensor on Windows IoT

I am working on a Raspberry Pi 2 with Windows IoT. I want to connect the Raspberry Pi with a MAX31855 Thermocouple Sensor which I bought on Adafruit. Theres a Python libary available on GitHub to read …

PIP: How to Cascade Requirements Files and Use Private Indexes? [duplicate]

This question already has an answer here:Installing Packages from Multiple Servers from One or More Requirements File(1 answer)Closed 9 years ago.I am trying to deploy a Django app to Heroku where one …

Python error: TypeError: cannot compare a dtyped [float64] array with a scalar of type [bool]

Below is a small sample of my dataframe. In [10]: dfOut[10]:TXN_KEY Send_Agent Pay_Agent Send_Amount Pay_Amount 0 13272184 AWD120279 AEU002152 85.99 85.04 1 13272947 ARA03012…

How to remove grey boundary lines in a map when plotting a netcdf using imshow in matplotlib?

Is it possible to remove the grey boundary lines around the in following map? I am trying to plotting a netcdf using matplotlib.from netCDF4 import Dataset # clarify use of Dataset import matplotlib.p…

function that takes one column value and returns another column value

Apologies, if this is a duplicate please let me know, Ill gladly delete.My dataset: Index Col 1 Col 20 1 4, 5, 6 1 2 7, 8, 9 2 3 10, 11, 12 3 …

Python write value from dictionary based on match in range of columns

From my df showing employees with multiple levels of managers (see prior question here), I want to map rows to a department ID, based on a manager ID that may appear across multiple columns:eid, mid…

Merge two dataframes based on a column

I want to compare name column in two dataframes df1 and df2 , output the matching rows from dataframe df1 and store the result in new dataframe df3. How do i do this in Pandas ? df1place name qty unit…

Python/Pandas - building a new column based in columns comparison

I have this dataframe:df:CNPJ Revenues 2016 Revenues 2015 Revenues 2014 0 01.637.895/0001-32 R$ 12.696.658 NaN R$ 10.848.213 1 02.916.265/0001-60 …