I'm working with the NYT corpus in Python and attempting to extract only what's located inside "full_text" class of every .xml article file. For example:
<body.content><block class="lead_paragraph"><p>LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.</p></block><block class="full_text"><p>LEAD: Two police officers responding to a reported robbery at a Brooklyn tavern early yesterday were themselves held up by the robbers, who took their revolvers and herded them into a back room with patrons, the police said.</p></block>
Ideally, I'd like to parse out only the string, yielding "LEAD: Two police officers responding to a reported robbery..." but I'm unsure of what the best approach would be. Is this something that can be easily parsed by regex? If so, nothing I've attempted seems to work.
Any advice would be appreciated!