Beautifulsoup find element by text using `find_all` no matter if there are elements in it

2024/10/6 14:23:31

For example

bs = BeautifulSoup("<html><a>sometext</a></html>")
print bs.find_all("a",text=re.compile(r"some"))

returns [<a>sometext</a>] but when element searched for has a child, i.e. img

bs = BeautifulSoup("<html><a>sometext<img /></a></html>")
print bs.find_all("a",text=re.compile(r"some"))

it returns []

Is there a way to use find_all to match the later example?

Answer

You will need to use a hybrid approach since text= will fail when an element has child elements as well as text.

bs = BeautifulSoup("<html><a>sometext</a></html>")    
reg = re.compile(r'some')
elements = [e for e in bs.find_all('a') if reg.match(e.text)]

Background

When BeautifulSoup is searching for an element, and text is a callable, it eventually eventually calls:

self._matches(found.string, self.text)

In the two examples you gave, the .string method returns different things:

>>> bs1 = BeautifulSoup("<html><a>sometext</a></html>")
>>> bs1.find('a').string
u'sometext'
>>> bs2 = BeautifulSoup("<html><a>sometext<img /></a></html>")
>>> bs2.find('a').string
>>> print bs2.find('a').string
None

The .string method looks like this:

@property
def string(self):"""Convenience property to get the single string within this tag.:Return: If this tag has a single string child, return valueis that string. If this tag has no children, or more than onechild, return value is None. If this tag has one child tag,return value is the 'string' attribute of the child tag,recursively."""if len(self.contents) != 1:return Nonechild = self.contents[0]if isinstance(child, NavigableString):return childreturn child.string

If we print out the contents we can see why this returns None:

>>> print bs1.find('a').contents
[u'sometext']
>>> print bs2.find('a').contents
[u'sometext', <img/>]
https://en.xdnf.cn/q/70361.html

Related Q&A

Python: how to get values from a dictionary from pandas series

I am very new to python and trying to get value from dictionary where keys are defined in a dataframe column (pandas). I searched quite a bit and the closest thing is a question in the link below, but…

Django No Module Named URLs error

There are many similar questions posted already, but Ive already tried those solutions to no avail. Im working through a basic Django tutorial, and here is my code:urls.pyfrom django.conf.urls import …

Matplotlib artist to stay same size when zoomed in but ALSO move with panning?

This is a very direct follow-up on this question.Using matplotlib, Id like to be able to place a sort of "highlighting bar" over a range of data markers that I know will all be in a straight …

How to invoke Lambda function with Event Invocation Type via API Gateway?

Docs says:By default, the Invoke API assumes RequestResponse invocation type. You can optionally request asynchronous execution by specifying Event as the InvocationType. So all I can send to my functi…

How do I unlock the app engine database when localhost runs?

Right now I get a blank page when localhost runs, but the deployed app is fine. The logs show the "database is locked". How do I "unlock" the database for localhost?

PyCrypto: Generate RSA key protected with DES3 password

I have been able to create a RSA key protected by password with DES3 (well... I think because Im very new to this encryption world) by using the command:openssl genrsa -out "/tmp/myKey.pem" -…

Normalize/Standardize a numpy recarray

I wonder what the best way of normalizing/standardizing a numpy recarray is. To make it clear, Im not talking about a mathematical matrix, but a record array that also has e.g. textual columns (such as…

How to read /dev/log?

I would like to directly access to syslog messages from Python by reading /dev/log.My (very limited) understanding is that the correct way is to read from there is to bind a datagram socket. import soc…

Determining if a number evenly divides by 25, Python

Im trying to check if each number in a list is evenly divisible by 25 using Python. Im not sure what is the right process. I want to do something like this:n = [100, 101, 102, 125, 355, 275, 435, 134, …

OpenCV SimpleBlobDetector detect() call throws cv2.error: Unknown C++ exception from OpenCV code?

I need to detect semicircles on image and I find follow stuff for this: import cv2 import numpy as npdef get_circle(img_path):im = cv2.imread(img_path, cv2.IMREAD_GRAYSCALE)detector = cv2.SimpleBlobDet…