Python -- Regex -- How to find a string between two sets of strings

2024/9/25 0:39:41

Consider the following:

<div id=hotlinklist><a href="foo1.com">Foo1</a><div id=hotlink><a href="/">Home</a></div><div id=hotlink><a href="/extract">Extract</a></div><div id=hotlink><a href="/sitemap">Sitemap</a></div>
</div>

How would you go about taking out the sitemap line with regex in python?

<a href="/sitemap">Sitemap</a>

The following can be used to pull out the anchor tags.

'/<a(.*?)a>/i'

However, there are multiple anchor tags. Also there are multiple hotlink(s) so we can't really use them either?

Answer

Don't use a regex. Use BeautfulSoup, an HTML parser.

from BeautifulSoup import BeautifulSouphtml = \
"""
<div id=hotlinklist><a href="foo1.com">Foo1</a><div id=hotlink><a href="/">Home</a></div><div id=hotlink><a href="/extract">Extract</a></div><div id=hotlink><a href="/sitemap">Sitemap</a></div>
</div>"""soup = BeautifulSoup(html)
soup.findAll("div",id="hotlink")[2].a# <a href="/sitemap">Sitemap</a>
https://en.xdnf.cn/q/71640.html

Related Q&A

Kivy TextInput horizontal and vertical align (centering text)

How to center a text horizontally in a TextInput in Kivy?I have the following screen:But I want to centralize my text like this:And this is part of my kv language:BoxLayout: orientation: verticalLabe…

How to capture python SSL(HTTPS) connection through fiddler2

Im trying to capture python SSL(HTTPS) connections through Fiddler2 local proxy. But I only got an error.codeimport requests requests.get("https://www.python.org", proxies={"http": …

removing leading 0 from matplotlib tick label formatting

How can I change the ticklabels of numeric decimal data (say between 0 and 1) to be "0", ".1", ".2" rather than "0.0", "0.1", "0.2" in matplo…

How do I check if an iterator is actually an iterator container?

I have a dummy example of an iterator container below (the real one reads a file too large to fit in memory):class DummyIterator:def __init__(self, max_value):self.max_value = max_valuedef __iter__(sel…

Python Terminated Thread Cannot Restart

I have a thread that gets executed when some action occurs. Given the logic of the program, the thread cannot possibly be started while another instance of it is still running. Yet when I call it a sec…

TypeError: NoneType object is not subscriptable [duplicate]

This question already has an answer here:mysqldb .. NoneType object is not subscriptable(1 answer)Closed 8 years ago.The error: names = curfetchone()[0]TypeError: NoneType object is not subscriptable. …

Where can I find numpy.where() source code? [duplicate]

This question already has answers here:How do I use numpy.where()? What should I pass, and what does the result mean? [closed](2 answers)Closed 4 years ago.I have already found the source for the num…

NSUserNotificationCenter.defaultUserNotificationCenter() returns None in python

I am trying to connect to the Mountain Lion notification center via python. Ive installed pyobjc and am following the instructions here and here. Also see: Working with Mountain Lions Notification Cent…

Flask app hangs while processing the request

I have a simple flask app, single page, upload html and then do some processing on it on the POST; at POST request; i am using beautifulsoup, pandas and usually it takes 5-10 sec to complete the task. …

How do I make a server listen on multiple ports

I would like to listen on 100 different TCP port with the same server. Heres what Im currently doing:-import socket import selectdef main():server_socket = socket.socket(socket.AF_INET, socket.SOCK_STR…