Wrap multiple tags with BeautifulSoup

2024/11/14 13:15:23

I'm writing a python script that allow to convert a html doc into a reveal.js slideshow. To do this, I need to wrap multiple tags inside a <section> tag.

It's easy to wrap a single tag inside another one using the wrap() method. However I can't figure out how I can wrap multiple tags.

An example for clarification, the original html:

html_doc = """
<html><head><title>The Dormouse's story</title>
</head><body><h1 id="first-paragraph">First paragraph</h1><p>Some text...</p><p>Another text...</p><div><a href="http://link.com">Here's a link</a></div><h1 id="second-paragraph">Second paragraph</h1><p>Some text...</p><p>Another text...</p><script src="lib/.js"></script>
</body></html>
""""""

I'd like to wrap the <h1> and their next tags inside <section> tags, like this:

<html>
<head><title>The Dormouse's story</title>
</head>
<body><section><h1 id="first-paragraph">First paragraph</h1><p>Some text...</p><p>Another text...</p><div><a href="http://link.com">Here's a link</a></div></section><section><h1 id="second-paragraph">Second paragraph</h1><p>Some text...</p><p>Another text...</p></section><script src="lib/.js"></script>
</body></html>

Here's how I made the selection:

from bs4 import BeautifulSoup
import itertools
soup = BeautifulSoup(html_doc)
h1s = soup.find_all('h1')
for el in h1s:els = [i for i in itertools.takewhile(lambda x: x.name not in [el.name, 'script'], el.next_elements)]els.insert(0, el)print(els)

Output:

[<h1 id="first-paragraph">First paragraph</h1>, 'First paragraph', '\n  ', <p>Some text...</p>, 'Some text...', '\n  ', <p>Another text...</p>, 'Another text...', '\n  ', <div><a href="http://link.com">Here's a link</a>  </div>, '\n    ', <a href="http://link.com">Here's a link</a>, "Here's a link", '\n  ', '\n\n  '][<h1 id="second-paragraph">Second paragraph</h1>, 'Second paragraph', '\n  ', <p>Some text...</p>, 'Some text...', '\n  ', <p>Another text...</p>, 'Another text...', '\n\n  ']

The selection is correct but I can't see how to wrap each selection inside a <section> tag.

Answer

Finally I found how to use the wrap method in that case. I needed to understand that every change in the soup object is made in place.

from bs4 import BeautifulSoup
import itertools
soup = BeautifulSoup(html_doc)# wrap all h1 and next siblings into sections
h1s = soup.find_all('h1')
for el in h1s:els = [i for i in itertools.takewhile(lambda x: x.name not in [el.name, 'script'],el.next_siblings)]section = soup.new_tag('section')el.wrap(section)for tag in els:section.append(tag)print(soup.prettify())

This gives me the desired output. Hopes that's help.

https://en.xdnf.cn/q/72327.html

Related Q&A

How to permanently delete a file in python 3 and higher?

I want to permanently delete a file i have created with my python code. I know the os.remove() etc but cant find anything specific to delete a file permanently.(Dont want to fill Trash with unused file…

Django. Python social auth. create profiles at the end of pipeline

I want to add a function at the end of the auth pipeline, the function is meant to check if there is a "Profiles" table for that user, if there isnt it will create a table. The Profiles mode…

What is a good audio library for validating files in Python?

Im already checking for content-type, size, and extension (Django (audio) File Validation), but I need a library to read the file and confirm that it is in fact what I hope it is (mp3 and mp4 mostly).I…

Python 3.6+: Nested multiprocessing managers cause FileNotFoundError

So Im trying to use multiprocessing Manager on a dict of dicts, this was my initial try:from multiprocessing import Process, Managerdef task(stat):test[z] += 1test[y][Y0] += 5if __name__ == __main__:te…

Convert python disassembly from dis.dis back to codeobject

Is there any way to create code object from its disassembly acquired with dis.dis?For example, I compiled some code using co = compile(print("lol"), <string>, exec) and then printed di…

Loop over a tensor and apply function to each element

I want to loop over a tensor which contains a list of Int, and apply a function to each of the elements. In the function every element will get the value from a dict of python. I have tried the easy wa…

How to quickly get the last line from a .csv file over a network drive?

I store thousands of time series in .csv files on a network drive. Before I update the files, I first get the last line of the file to see the timestamp and then I update with data after that timestamp…

Force use of scientific style for basemap colorbar labels

String formatting can by used to specify scientific notation for matplotlib.basemap colorbar labels:cb = m.colorbar(cs, ax=ax1, format=%.4e)But then each label is scientifically notated with the base.I…

VS Code Doesnt Recognize Python Virtual Environment

Im using VS Code on a Mac to write Python code. Ive created a virtual environment named venv inside my project folder and opened VS Code in my project folder. I can see the venv folder in the Explore…

Why codecs.iterdecode() eats empty strings?

Why the following two decoding methods return different results?>>> import codecs >>> >>> data = [, , a, ] >>> list(codecs.iterdecode(data, utf-8)) [ua] >>>…