Question 1

I've used a script to run selenium locally so that I can make use of the response (derived from selenium) within my spider.

This is the web service where selenium runs locally:

from flask import Flask, request, make_response
from flask_restful import Resource, Api
from selenium import webdriver
from selenium.webdriver.chrome.options import Optionsapp = Flask(__name__)
api = Api(app)class Selenium(Resource):_driver = None@staticmethoddef getDriver():if not Selenium._driver:chrome_options = Options()chrome_options.add_argument("--headless")Selenium._driver = webdriver.Chrome(options=chrome_options)return Selenium._driver@propertydef driver(self):return Selenium.getDriver()def get(self):url = str(request.args['url'])self.driver.get(url)return make_response(self.driver.page_source)api.add_resource(Selenium, '/')if __name__ == '__main__':app.run(debug=True)

This is my scrapy spider which takes the benefit of that response to parse the title from a webpage.

import scrapy
from urllib.parse import quote
from scrapy.crawler import CrawlerProcessclass StackSpider(scrapy.Spider):name = 'stackoverflow'url = 'https://stackoverflow.com/questions/tagged/web-scraping?sort=newest&pageSize=50'base = 'https://stackoverflow.com'def start_requests(self):link = 'http://127.0.0.1:5000/?url={}'.format(quote(self.url))yield scrapy.Request(link,callback=self.parse)def parse(self, response):for item in response.css(".summary .question-hyperlink::attr(href)").getall():nlink = self.base + itemlink = 'http://127.0.0.1:5000/?url={}'.format(quote(nlink))yield scrapy.Request(link,callback=self.parse_info,dont_filter=True)def parse_info(self, response):item = response.css('h1[itemprop="name"] > a::text').get()yield {"title":item}if __name__ == '__main__':c = CrawlerProcess()c.crawl(StackSpider)c.start()

The problem is the above script gives me the same title multiple times and then another title and so on.

What possible chage should I bring about to make my script work in the right way?

Question 2

I ran both the scripts, and they run as intended. So my findings :

downloader/exception_type_count/twisted.internet.error.ConnectionRefusedError there is no means to get through this error without permission of the server, here i.e ebay.
Logs from scrapy:

2019-05-25 07:28:41 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/exception_count': 72, 'downloader/exception_type_count/twisted.internet.error.ConnectionRefusedError': 64, 'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 8, 'downloader/request_bytes': 55523, 'downloader/request_count': 81, 'downloader/request_method_count/GET': 81, 'downloader/response_bytes': 2448476, 'downloader/response_count': 9, 'downloader/response_status_count/200': 9, 'finish_reason': 'shutdown', 'finish_time': datetime.datetime(2019, 5, 25, 1, 58, 41, 234183), 'item_scraped_count': 8, 'log_count/DEBUG': 90, 'log_count/INFO': 9, 'request_depth_max': 1, 'response_received_count': 9, 'retry/count': 72, 'retry/reason_count/twisted.internet.error.ConnectionRefusedError': 64, 'retry/reason_count/twisted.web._newclient.ResponseNeverReceived': 8, 'scheduler/dequeued': 81, 'scheduler/dequeued/memory': 81, 'scheduler/enqueued': 131, 'scheduler/enqueued/memory': 131, 'start_time': datetime.datetime(2019, 5, 25, 1, 56, 57, 751009)} 2019-05-25 07:28:41 [scrapy.core.engine] INFO: Spider closed (shutdown)

you can see only 8 items scraped. These are just the logos and other unrestricted things.

Server Log :

s://.ebaystatic.com http://.ebay.com https://*.ebay.com". Either the 'unsafe-inline' keyword, a hash ('sha256-40GZDfucnPVwbvI/Q1ivGUuJtX8krq8jy3tWNrA/n58='), or a nonce ('nonce-...') is required to enable inline execution. ", source: https://vi.vipr.ebaydesc.com/ws/eBayISAPI.dll?ViewItemDescV4&item=323815597324&t=0&tid=10&category=169291&seller=wardrobe-ltd&excSoj=1&excTrk=1&lsite=0&ittenable=false&domain=ebay.com&descgauge=1&cspheader=1&oneClk=1&secureDesc=1 (1)

Ebay does not allow you to scrape itself.

So how to complete your task >>

Everytime before scraping check /robots.txt for the same site. For ebay its : http://www.ebay.com/robots.txt And you can see almost everything is disallowed.

User-agent: * Disallow: /*rt=nc Disallow: /b/LH_ Disallow: /brw/ Disallow: /clp/ Disallow: /clt/store/ Disallow: /csc/ Disallow: /ctg/ Disallow: /ctm/ Disallow: /dsc/ Disallow: /edc/ Disallow: /feed/ Disallow: /gsr/ Disallow: /gwc/ Disallow: /hcp/ Disallow: /itc/ Disallow: /lit/ Disallow: /lst/ng/ Disallow: /lvx/ Disallow: /mbf/ Disallow: /mla/ Disallow: /mlt/ Disallow: /myb/ Disallow: /mys/ Disallow: /prp/ Disallow: /rcm/ Disallow: /sch/%7C Disallow: /sch/LH_ Disallow: /sch/aop/ Disallow: /sch/ctg/ Disallow: /sl/node Disallow: /sme/ Disallow: /soc/ Disallow: /talk/ Disallow: /tickets/ Disallow: /today/ Disallow: /trylater/ Disallow: /urw/write-review/ Disallow: /vsp/ Disallow: /ws/ Disallow: /sch/modules=SEARCH_REFINEMENTS_MODEL_V2 Disallow: /b/modules=SEARCH_REFINEMENTS_MODEL_V2 Disallow: /itm/_nkw Disallow: /itm/?fits Disallow: /itm/&fits Disallow: /cta/
Therefore go to https://developer.ebay.com/api-docs/developer/static/developer-landing.html and check their docs, there are easier example code in their site to get the items needs without scraping.

Unable to make my script process locally created server response in the right way

Related Q&A

using variable in a url in python

Create dynamic updated graph with Python

Converting a nested dictionary to a list

Pandas dataframe : Operation per batch of rows

Combining grid/pack Tkinter

Mocking assert_called_with in Python

Python select() behavior is strange

Moving items up and down in a QListWidget?

How to resize a tiff image with multiple channels?

Losing elements in python code while creating a dictionary from a list?