How to extract URL from HTML anchor element using Python3? [closed]

2024/11/17 18:55:02

I want to extract URL from web page HTML source.
Example:

xyz.com source code:
<a rel="nofollow" href="example/hello/get/9f676bac2bb3.zip">Download XYZ</a>

I want to extract:

example/hello/get/9f676bac2bb3.zip

How to extract this URL?

I don't understand regex. Also I don't know how to install Beautiful Soup 4 or lxml on Windows. I'm getting errors when I try to install this libraries.

I've tried:

C:\Users\admin\Desktop>python
Python 3.3.2 (v3.3.2:d047928ae3f6, May 16 2013, 00:03:43) [MSC v.1600 32 bit (In
tel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> url = '<a rel="nofollow" href="/example/hello/get/9f676bac2bb3.zip">XYZ</a>'
>>> r = re.compile('(?<=href=").*?(?=")')
>>> r.findall(url)
['/example/hello/get/9f676bac2bb3.zip']
>>> url
'<a rel="nofollow" href="/example/hello/get/9f676bac2bb3.zip">Download XYZ</a>'
>>> r.findall(url)[0]
'/example/hello/get/9f676bac2bb3.zip'
>>> a = "https://xyz.com"
>>> print(a + r.findall(url)[0])
https://xyz.com/example/hello/get/9f676bac2bb3.zip
>>>

But just it's a hardcoded HTML sample. How to get the web page source and run my code against it?

Answer

You can use built-in xml.etree.ElementTree instead:

>>> import xml.etree.ElementTree as ET
>>> url = '<a rel="nofollow" href="/example/hello/get/9f676bac2bb3.zip">XYZ</a>'
>>> ET.fromstring(url).attrib.get('href')
'/example/hello/get/9f676bac2bb3.zip'

This works on this particular example, but xml.etree.ElementTree is not an HTML parser. Consider using BeautifulSoup:

>>> from bs4 import BeautifulSoup
>>> BeautifulSoup(url).a.get('href')
'/example/hello/get/9f676bac2bb3.zip'

Or, lxml.html:

>>> import lxml.html
>>> lxml.html.fromstring(url).attrib.get('href')
'/example/hello/get/9f676bac2bb3.zip'

Personally, I prefer BeautifulSoup - it makes html-parsing easy, transparent and fun.


To follow the link and download the file, you need to make a full url including the schema and domain (urljoin() would help) and then use urlretrieve(). Example:

>>> BASE_URL = 'http://example.com'
>>> from urllib.parse import urljoin
>>> from urllib.request import urlretrieve
>>> href = BeautifulSoup(url).a.get('href')
>>> urlretrieve(urljoin(BASE_URL, href))

UPD (for the different html posted in comments):

>>> from bs4 import BeautifulSoup
>>> data = '<html> <head> <body><example><example2> <a rel="nofollow" href="/example/hello/get/9f676bac2bb3.zip">XYZ</a> </example2></example></body></head></html>'
>>> href = BeautifulSoup(data).find('a', text='XYZ').get('href')
'/example/hello/get/9f676bac2bb3.zip'
https://en.xdnf.cn/q/120167.html

Related Q&A

Why does using open(filename) fail with filename is not defined?

I want to open a text file containing a column of words and create a list or, alternatively, a string containing these words. Why do I get this error: >>> with open(some_file.txt, r) as some_f…

Read Excel file which has one of the column as Hyperlink through python

I have to read an Excel file Using python. By the time I use xl = pd.ExcelFile("abc.xlsx")The column values which had hyperlink assigned to it becomes a simple number without any hyperlink.Is…

How can I create n number of files in python?

say user gives a number n=3 then I have to create 3 files dynamically. How will I do that? What can be the names of those files. Specifically I want n number of .jpg file created.

For loop doesnt append info correctly into 2D array

I have created an empty 2D array. When I try to add stuff inside of it, it doesnt do so properly. Each index contains the appropriate info, but for some reason, carries the info from the previous into …

How to merge two strings in python?

I need to merge strings together to create one string. For example the strings for "hello" that need to be combined are: [H----], [-E---], [--LL-], and [----O]This is the current code I have …

ValueError: too many values to unpack (expected 3)?

I have been having issues with the code I am trying to right with the model I am trying to code the following error has appeared and being a relative novice I am unsure of how to resolve it.ValueError …

Finding a hello word in a different string, which it has hello in it

I should find a hello word in a string, which I gave it from input. Here is the code that I currently have, but I cannot match hello with the character list.mylist = it can be any letter plus hello in …

Building Permutation with Python

Im trying to write a programme to get all permutations of a string of letter using recursion. As Im a beginner in Python, I learnt about recursion with examples like Fibonacci Number and Factorial. I u…

Python: Is There a builtin that works similar but opposite to .index()?

Just a forewarning: I just recently started programming and Python is my first language and only language so far.Is there a builtin that works in the opposite way of .index()? Im looking for this beca…

Get a subset of a data frame into a matrix

I have this data frame:I want just the numbers under August - September to be placed into a matrix, how can I do this?I tried this cf = df.iloc[:,1:12] which gives me but it gives me the headers as w…