I don't understand regex. Also I don't know how to install Beautiful Soup 4 or lxml on Windows. I'm getting errors when I try to install this libraries.
I've tried:
C:\Users\admin\Desktop>python
Python 3.3.2 (v3.3.2:d047928ae3f6, May 16 2013, 00:03:43) [MSC v.1600 32 bit (In
tel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> url = '<a rel="nofollow" href="/example/hello/get/9f676bac2bb3.zip">XYZ</a>'
>>> r = re.compile('(?<=href=").*?(?=")')
>>> r.findall(url)
['/example/hello/get/9f676bac2bb3.zip']
>>> url
'<a rel="nofollow" href="/example/hello/get/9f676bac2bb3.zip">Download XYZ</a>'
>>> r.findall(url)[0]
'/example/hello/get/9f676bac2bb3.zip'
>>> a = "https://xyz.com"
>>> print(a + r.findall(url)[0])
https://xyz.com/example/hello/get/9f676bac2bb3.zip
>>>
But just it's a hardcoded HTML sample. How to get the web page source and run my code against it?
Answer
You can use built-in xml.etree.ElementTree instead:
>>> import xml.etree.ElementTree as ET
>>> url = '<a rel="nofollow" href="/example/hello/get/9f676bac2bb3.zip">XYZ</a>'
>>> ET.fromstring(url).attrib.get('href')
'/example/hello/get/9f676bac2bb3.zip'
This works on this particular example, but xml.etree.ElementTree is not an HTML parser. Consider using BeautifulSoup:
>>> from bs4 import BeautifulSoup
>>> BeautifulSoup(url).a.get('href')
'/example/hello/get/9f676bac2bb3.zip'
Personally, I prefer BeautifulSoup - it makes html-parsing easy, transparent and fun.
To follow the link and download the file, you need to make a full url including the schema and domain (urljoin() would help) and then use urlretrieve(). Example:
I want to open a text file containing a column of words and create a list or, alternatively, a string containing these words.
Why do I get this error:
>>> with open(some_file.txt, r) as some_f…
I have to read an Excel file Using python. By the time I use xl = pd.ExcelFile("abc.xlsx")The column values which had hyperlink assigned to it becomes a simple number without any hyperlink.Is…
say user gives a number n=3
then I have to create 3 files dynamically. How will I do that? What can be the names of those files. Specifically I want n number of .jpg file created.
I have created an empty 2D array. When I try to add stuff inside of it, it doesnt do so properly. Each index contains the appropriate info, but for some reason, carries the info from the previous into …
I need to merge strings together to create one string. For example the strings for "hello" that need to be combined are:
[H----], [-E---], [--LL-], and [----O]This is the current code I have …
I have been having issues with the code I am trying to right with the model I am trying to code the following error has appeared and being a relative novice I am unsure of how to resolve it.ValueError …
I should find a hello word in a string, which I gave it from input.
Here is the code that I currently have, but I cannot match hello with the character list.mylist = it can be any letter plus hello in …
Im trying to write a programme to get all permutations of a string of letter using recursion. As Im a beginner in Python, I learnt about recursion with examples like Fibonacci Number and Factorial. I u…
Just a forewarning: I just recently started programming and Python is my first language and only language so far.Is there a builtin that works in the opposite way of .index()? Im looking for this beca…
I have this data frame:I want just the numbers under August - September to be placed into a matrix, how can I do this?I tried this cf = df.iloc[:,1:12] which gives me but it gives me the headers as w…