Skipp the error while scraping a list of urls form a csv

2024/11/20 19:35:59

I managed to scrape a list of urls from a CSV file, but I got a problem, the scraping stops when it hits a broken link. Also it prints a lot of None lines, is it possible to get rid of them ?

Would appreciate some help here. Thank you in advance !

Here is the code :

#!/usr/bin/python
# -*- coding: utf-8 -*-from bs4 import BeautifulSoup #required to parse html
import requests #required to make request#read file
with open('urls.csv','r') as f:csv_raw_cont=f.read()#split by line
split_csv=csv_raw_cont.split('\n')#specify separator
separator=";"#iterate over each line
for each in split_csv:#specify the row indexurl_row_index=0 #in our csv example file the url is the first row so we set 0#get the urlurl = each.split(separator)[url_row_index] #fetch content from serverhtml = requests.get(url).content#soup fetched contentsoup = BeautifulSoup(html,'lxml')tags = soup.find("div", {"class": "productsPicture"}).findAll("a")for tag in tags:print(tag.get('href'))

And the result with the error looks like this :

https://www.tennis-point.com/asics-gel-resolution-7-all-court-shoe-men-white-silver-02013802720000.html
None
https://www.tennis-point.com/cep-ultralight-run-sports-socks-men-black-light-green-12143000063000.html
None
https://www.tennis-point.com/asics-gel-solution-speed-3-clay-court-shoe-men-white-grey-02013802634000.html
None
https://www.tennis-point.com/asics-gel-solution-speed-3-all-court-shoe-men-white-silver-02013802723000.html
None
https://www.tennis-point.com/asics-gel-challenger-9-indoor-carpet-shoe-men-white-grey-02012401735000.html
None
https://www.tennis-point.com/asics-gel-court-speed-clay-court-shoe-men-dark-blue-yellow-02014202833000.html
None
https://www.tennis-point.com/asics-gel-court-speed-all-court-shoe-men-white-silver-02014202832000.html
None
Traceback (most recent call last):
File "/Users/imaging-adrian/Desktop/Python Scripts/close_to_work.py", line 33, in <module>
tags = soup.find("div", {"class": "productsPicture"}).findAll("a")
AttributeError: 'NoneType' object has no attribute 'findAll'
[Finished in 3.7s with exit code 1]
[shell_cmd: python -u "/Users/imaging-adrian/Desktop/Python 
Scripts/close_to_work.py"]
[dir: /Users/imaging-adrian/Desktop/Python Scripts]
[path: /Users/imaging-adrian/anaconda3/bin:/Library/Frameworks/Python.framework/Versions/3.6/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/munki]

The links inside my CSV files look like this :

https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E701Y-0193;
https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E601N-4907;
https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E601N-0193;
https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E600N-0193;
https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E326Y-0174;
https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E801N-4589;
https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E800N-0193;
https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E800N-9093;
https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E800N-4589;
https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E804N-9095;
Answer

Here is working version,

from bs4 import BeautifulSoup
import requests
import csvwith open('urls.csv', 'r') as csvFile, open('results.csv', 'w', newline='') as results:reader = csv.reader(csvFile, delimiter=';')writer = csv.writer(results)for row in reader:# get the urlurl = row[0]# fetch content from serverhtml = requests.get(url).content# soup fetched contentsoup = BeautifulSoup(html, 'html.parser')divTag = soup.find("div", {"class": "productsPicture"})if divTag:tags = divTag.findAll("a")else:continuefor tag in tags:res = tag.get('href')if res != None:writer.writerow([res])
https://en.xdnf.cn/q/119794.html

Related Q&A

Getting the TypeError - int object is not callable [closed]

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.This question was caused by a typo or a problem that can no longer be reproduced. While similar q…

Reordering columns in CSV

Question has been posted before but the requirements were not properly conveyed. I have a csv file with more than 1000 columns:A B C D .... X Y Z 1 0 0.5 5 .... 1 7 6 2 0 0.6 4 …

Variable not defined in while loop in python?

I am trying to write a simple program in python to read command line arguments and print a final word based on the arguments. If there is any argument of the form "-f=" then the will go to t…

Hours and time converting to a certain format [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.Want to improve this question? Add details and clarify the problem by editing this post.Closed 3 years ago.Improve…

Python socket server: listening to multiple clients [closed]

Its difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying thi…

I have a problem with encoding with russian language for my python script [duplicate]

This question already has answers here:UnicodeEncodeError: ascii codec cant encode character u\xa0 in position 20: ordinal not in range(128)(34 answers)Closed last year.I am trying to send an email fro…

how do you style data frame in Pandas

I have this data frame: dfServer Env. Model Percent_Utilized server123 Prod Cisco. 50 server567. Prod Cisco. 80 serverabc. Prod IBM. 100 serverdwc.…

Vacation price program Python [closed]

Closed. This question needs debugging details. It is not currently accepting answers.Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to repro…

Why did push of a Flask app to Heroku failed?

Im simply trying to push my Flask app to Heroku but I encountered the following error: remote: ERROR: Command errored out with exit status 1: remote: command: /app/.heroku/python…

How to navigate through HTMl pages that have paging for their content using Python? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.Want to improve this question? Add details and clarify the problem by editing this post.Closed 6 years ago.Improve…