HTML Link parsing using BeautifulSoup

2024/10/14 23:18:14

here is my Python code which I'm using to extract the Specific HTML from the Page links I'm sending as parameter. I'm using BeautifulSoup. This code works fine for sometimes and sometimes it is getting stuck!

import urllib
from bs4 import BeautifulSouprawHtml = ''
url = r''
for i in range(1, 49):  #iterate url and capture contentsock = urllib.urlopen(url+ str(i))html =  sock.close()rawHtml += htmlprint i

Here I'm printing the loop variable to find out where it is getting stuck. It shows me that it's getting stuck randomly at any of the loop sequence.

soup = BeautifulSoup(rawHtml, 'html.parser')
for link in soup.find_all('a'):t += str(link.get('href')) + "</br>"#t += str(link) + "</br>"
f = open("Link.txt", 'w+')

what could be the possible issue. Is it the problem with the socket configuration or some other issue.

This is the error I got. I checked these links - python-gaierror-errno-11004,ioerror-errno-socket-error-errno-11004-getaddrinfo-failed for the solution. But I didn't find it much helpful.

Traceback (most recent call last):File "", line 8, in <module>sock = urllib.urlopen(url+ str(i))File "d:\python\lib\", line 87, in urlopenreturn "d:\python\lib\", line 213, in openreturn getattr(self, name)(url)File "d:\python\lib\", line 350, in open_httph.endheaders(data)File "d:\python\lib\", line 1049, in endheadersself._send_output(message_body)File "d:\python\lib\", line 893, in _send_outputself.send(msg)File "d:\python\lib\", line 855, in sendself.connect()File "d:\python\lib\", line 832, in connectself.timeout, self.source_address)File "d:\python\lib\", line 557, in create_connectionfor res in getaddrinfo(host, port, 0, SOCK_STREAM):
IOError: [Errno socket error] [Errno 11004] getaddrinfo failed

It's running perfectly fine when I'm running it on my personal laptop. But It's giving error when I'm running it on Office Desktop. Also, My version of Python is 2.7. Hope these information will help.


Finally, guys.... It worked! Same script worked when I checked on other PC's too. So probably the problem was because of the firewall settings or proxy settings of my office desktop. which was blocking this website.

