I've used a script to run selenium locally so that I can make use of the response (derived from selenium) within my spider.
This is the web service where selenium runs locally:
from flask import Flask, request, make_response
from flask_restful import Resource, Api
from selenium import webdriver
from selenium.webdriver.chrome.options import Optionsapp = Flask(__name__)
api = Api(app)class Selenium(Resource):_driver = None@staticmethoddef getDriver():if not Selenium._driver:chrome_options = Options()chrome_options.add_argument("--headless")Selenium._driver = webdriver.Chrome(options=chrome_options)return Selenium._driver@propertydef driver(self):return Selenium.getDriver()def get(self):url = str(request.args['url'])self.driver.get(url)return make_response(self.driver.page_source)api.add_resource(Selenium, '/')if __name__ == '__main__':app.run(debug=True)
This is my scrapy spider which takes the benefit of that response to parse the title from a webpage.
import scrapy
from urllib.parse import quote
from scrapy.crawler import CrawlerProcessclass StackSpider(scrapy.Spider):name = 'stackoverflow'url = 'https://stackoverflow.com/questions/tagged/web-scraping?sort=newest&pageSize=50'base = 'https://stackoverflow.com'def start_requests(self):link = 'http://127.0.0.1:5000/?url={}'.format(quote(self.url))yield scrapy.Request(link,callback=self.parse)def parse(self, response):for item in response.css(".summary .question-hyperlink::attr(href)").getall():nlink = self.base + itemlink = 'http://127.0.0.1:5000/?url={}'.format(quote(nlink))yield scrapy.Request(link,callback=self.parse_info,dont_filter=True)def parse_info(self, response):item = response.css('h1[itemprop="name"] > a::text').get()yield {"title":item}if __name__ == '__main__':c = CrawlerProcess()c.crawl(StackSpider)c.start()
The problem is the above script gives me the same title multiple times and then another title and so on.
What possible chage should I bring about to make my script work in the right way?
I ran both the scripts, and they run as intended. So my findings :
downloader/exception_type_count/twisted.internet.error.ConnectionRefusedError
there is no means to get through this error without permission of the server, here i.e ebay.
Logs from scrapy:
2019-05-25 07:28:41 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 72,
'downloader/exception_type_count/twisted.internet.error.ConnectionRefusedError': 64,
'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 8,
'downloader/request_bytes': 55523,
'downloader/request_count': 81,
'downloader/request_method_count/GET': 81,
'downloader/response_bytes': 2448476,
'downloader/response_count': 9,
'downloader/response_status_count/200': 9,
'finish_reason': 'shutdown',
'finish_time': datetime.datetime(2019, 5, 25, 1, 58, 41, 234183),
'item_scraped_count': 8,
'log_count/DEBUG': 90,
'log_count/INFO': 9,
'request_depth_max': 1,
'response_received_count': 9,
'retry/count': 72,
'retry/reason_count/twisted.internet.error.ConnectionRefusedError': 64,
'retry/reason_count/twisted.web._newclient.ResponseNeverReceived': 8,
'scheduler/dequeued': 81,
'scheduler/dequeued/memory': 81,
'scheduler/enqueued': 131,
'scheduler/enqueued/memory': 131,
'start_time': datetime.datetime(2019, 5, 25, 1, 56, 57, 751009)}
2019-05-25 07:28:41 [scrapy.core.engine] INFO: Spider closed (shutdown)
you can see only 8
items scraped. These are just the logos and other unrestricted things.
Server Log
:
s://.ebaystatic.com http://.ebay.com https://*.ebay.com". Either the 'unsafe-inline' keyword, a hash ('sha256-40GZDfucnPVwbvI/Q1ivGUuJtX8krq8jy3tWNrA/n58='), or a nonce ('nonce-...') is required to enable inline execution.
", source: https://vi.vipr.ebaydesc.com/ws/eBayISAPI.dll?ViewItemDescV4&item=323815597324&t=0&tid=10&category=169291&seller=wardrobe-ltd&excSoj=1&excTrk=1&lsite=0&ittenable=false&domain=ebay.com&descgauge=1&cspheader=1&oneClk=1&secureDesc=1 (1)
Ebay does not allow you to scrape itself.
So how to complete your task >>
Everytime before scraping check /robots.txt
for the same site.
For ebay its : http://www.ebay.com/robots.txt
And you can see almost everything is disallowed.
User-agent: *
Disallow: /*rt=nc
Disallow: /b/LH_
Disallow: /brw/
Disallow: /clp/
Disallow: /clt/store/
Disallow: /csc/
Disallow: /ctg/
Disallow: /ctm/
Disallow: /dsc/
Disallow: /edc/
Disallow: /feed/
Disallow: /gsr/
Disallow: /gwc/
Disallow: /hcp/
Disallow: /itc/
Disallow: /lit/
Disallow: /lst/ng/
Disallow: /lvx/
Disallow: /mbf/
Disallow: /mla/
Disallow: /mlt/
Disallow: /myb/
Disallow: /mys/
Disallow: /prp/
Disallow: /rcm/
Disallow: /sch/%7C
Disallow: /sch/LH_
Disallow: /sch/aop/
Disallow: /sch/ctg/
Disallow: /sl/node
Disallow: /sme/
Disallow: /soc/
Disallow: /talk/
Disallow: /tickets/
Disallow: /today/
Disallow: /trylater/
Disallow: /urw/write-review/
Disallow: /vsp/
Disallow: /ws/
Disallow: /sch/modules=SEARCH_REFINEMENTS_MODEL_V2
Disallow: /b/modules=SEARCH_REFINEMENTS_MODEL_V2
Disallow: /itm/_nkw
Disallow: /itm/?fits
Disallow: /itm/&fits
Disallow: /cta/
Therefore go to https://developer.ebay.com/api-docs/developer/static/developer-landing.html and check their docs, there are easier example code in their site to get the items needs without scraping.