Beautiful soup: Extract everything between two tags when these tags have different ids

2024/11/18 21:50:20

Beautiful soup: Extract everything between two tags

I have seen a question through the above link where we are getting the information between two tags. Whereas I need to get the information between the tags when these tags are having two different id attribute values.

<h1 id = 'beautiful' ></h1>Text <i>here</i> has no tag<div>This is in a div</div><h1 id = 'good' ></h1>


I am using BeautifulSoup to extract data from HTML files. I want to get all of the information between the two tags. This means that if I have an HTML section like this:

<h1></h1>Text <i>here</i> has no tag<div>This is in a div</div><h1></h1>

Then if I wanted all of the information between the first h1 and the second h1, the output would look like this:

Text <i>here</i> has no tag<div>This is in a div</div>
from bs4 import BeautifulSouphtml_doc = '''
This I <b>don't</b> want
<h1></h1>
Text <i>here</i> has no tag
<div>This is in a div</div>
<h1></h1>
This I <b>don't</b> want too
'''soup = BeautifulSoup(html_doc, 'html.parser')for c in list(soup.contents):if c is soup.h1 or c.find_previous('h1') is soup.h1:continuec.extract()for h1 in soup.select('h1'):h1.extract()print(soup)

Prints:

Text <i>here</i> has no tag
<div>This is in a div</div>

This is working without ids.

Could someone help me in this regard?

Answer

The parent & decompose methods might be helpful for you.

# 1. Find the first item you are looking for. soup = BeautifulSoup(html_doc, 'html.parser')
hElem = soup.find("h1", {'id': 'beautiful'})# 2. Find the second condition. endElem = soup.find('h1', {'id': 'good'})# 3. Get parent element that contains both. hParent = hElem.parent  # Can be made more complex if multiple ancestors are needed to contain both conditions.# 4. Iterate through children and remove all children outside the conditions.childrenElems = hParent.children
inBetween = true
for child in childrenElems:if not inBetween:  child.decompose()if child == endElem:inBetween = false #  Remaining data.
print(childrenElems) 
https://en.xdnf.cn/q/118621.html

Related Q&A

exceptions.RuntimeError - Object has no attribute errno [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.Want to improve this question? Add details and clarify the problem by editing this post.Closed 6 years ago.Improve…

How can I translate this python function to c++?

I am trying to translate a python function to c++ without success. Can someone help me? The python function receives as input a string S and 2 integers (fragment_size and jump). The aim of this functi…

Reverse PDF imposition

I have an imposed document: there are 4 n A4 pages on the n sheets. I put them into a roller image scanner and receive one 2 n paged PDF document (A3).If, say, n = 3, then Ive got the following seque…

Python: How to run flask mysqldb on Windows machine?

Ive installed the flask-mysqldb module with pip package management system on my Windows machine and I dont know how to run it.I have tried to add the path to the MySQLdb in System properties and still …

Match a pattern and save to variable using python

I have an output file containing thousands of lines of information. Every so often I find in the output file information of the following formInput Orientation: ... content ... Distance matrix (angstro…

Sharing a Queue instance between different modules

I am new to Python and I would like to create what is a global static variable, my thread-safe and process-safe queue, between threads/processes created in different modules. I read from the doc that t…

Square a number with functions in python [duplicate]

This question already has answers here:What does it mean when the parentheses are omitted from a function or method call?(6 answers)Closed last year.This is an extremely easy question for Python. Its…

Changing the cell name

I have a file that contains the following:NameABCD0145ABCD1445ABCD0998And Im trying to write a cod that read every row and change the name to the following format:NameABCD_145ABCD_1445ABCD_998keeping i…

Procfile Heroku

I tried to deploy my first Telegram chatbot (done with Chatterbot library) on Heroku. The files of my chatbot are: requirements (txt file) Procfile (worker: python magghybot.py) botusers (csv file) Mag…

How do i loop a code until a certain number is created?

This task is to determine the difference between two attributes, strength and skill, from game characters. The process for this is:Determining the difference between the strength attributes. The differ…