Nested tags in BeautifulSoup - Python

2024/5/20 15:52:11

I've looked at many examples on websites and on stackoverflow but I couldn't find a universal solution to my question. I'm dealing with a really messy website and I'd like to scrape some data. The markup looks like so:

...
<body>
...<table><tbody><tr>...</tr><tr><td>...</td><td><table><tr>...</tr><tr><td><a href="...">Some link</a><a href="...">Some link</a><a href="...">Some link</a></td></tr></table></td></tr></tbody></table>
</body>

The issue I'm having is that none of the elements have attributes that I can select to narrow down some scope. Inside each of the "..." there may be similar markup such as more <a>'s <table> and whatnot.

I know that table tr table tr td a is unique to the links I need, but how would BeautifulSoup grab those? I'm not sure how grab nested tags without doing a bunch of individual lines of code.

Any help?

Answer

You can use CSS selectors in select:

soup.select('table tr table tr td a')

In [32]: bs4.BeautifulSoup(urllib.urlopen('http://google.com/?hl=en').read()).select('#footer a')
Out[32]:
[<a href="/intl/en/ads/">Advertising Programs</a>,<a href="/services/">Business Solutions</a>,<a href="https://plus.google.com/116899029375914044550" rel="publisher">+Google</a>,<a href="/intl/en/about.html">About Google</a>,<a href="http://www.google.com/setprefdomain?prefdom=RU&amp;prev=http://www.google.ru/&amp;sig=0_3F2sRGWVktTCOFLA955Vr-AWlHo%3D">Google.ru</a>,<a href="/intl/en/policies/">Privacy &amp; Terms</a>]
https://en.xdnf.cn/q/72934.html

Related Q&A

How do I check if a string is a negative number before passing it through int()?

Im trying to write something that checks if a string is a number or a negative. If its a number (positive or negative) it will passed through int(). Unfortunately isdigit() wont recognize it as a numbe…

openpyxl chage font size of title y_axis.title

I am currently struggling with changing the font of y axis title & the charts title itself.I have tried to create a font setting & applying it to the titles - with no luck what so ever. new_cha…

Combination of all possible cases of a string

I am trying to create a program to generate all possible capitalization cases of a string in python. For example, given abcedfghij, I want a program to generate: Abcdefghij ABcdef.. . . aBcdef.. . ABCD…

How to change download directory location path in Selenium using Chrome?

Im using Selenium in Python and Im trying to change the download path. But either this: prefs = {"download.default_directory": "C:\\Users\\personal\\Downloads\\exports"} options.add…

Keras, TensorFlow : TypeError: Cannot interpret feed_dict key as Tensor

I am trying to use keras fune-tuning to develop image classify applications. I deployed that application to a web server and the image classification is succeeded.However, when the application is used …

How to get matplotlib to place lines accurately?

By default, matplotlib plot can place lines very inaccurately.For example, see the placement of the left endpoint in the attached plot. Theres at least a whole pixel of air that shouldnt be there. In f…

Using Flask as pass through proxy for file upload?

Its for app engines blobstore since its upload interface generates a temporary endpoint every time. Id like to take the comlexity out of frontend, Flask would take the post request and forward it to th…

What does printing an empty line do?

I know this question may well be the silliest question youve heard today, but to me it is a big question at this stage of my programming learning.Why is the second empty line needed in this Python code…

Django - how do I _not_ dispatch a signal?

I wrote some smart generic counters and managers for my models (to avoid select count queries etc.). Therefore I got some heavy logic going on for post_save. I would like to prevent handling the signa…

Python 3 Decoding Strings

I understand that this is likely a repeat question, but Im having trouble finding a solution.In short I have a string Id like to decode:raw = "\x94my quote\x94" string = decode(raw)expected f…