Unable to make my script process locally created server response in the right way

2024/11/15 0:54:38

I've used a script to run selenium locally so that I can make use of the response (derived from selenium) within my spider.

This is the web service where selenium runs locally:

from flask import Flask, request, make_response
from flask_restful import Resource, Api
from selenium import webdriver
from selenium.webdriver.chrome.options import Optionsapp = Flask(__name__)
api = Api(app)class Selenium(Resource):_driver = None@staticmethoddef getDriver():if not Selenium._driver:chrome_options = Options()chrome_options.add_argument("--headless")Selenium._driver = webdriver.Chrome(options=chrome_options)return Selenium._driver@propertydef driver(self):return Selenium.getDriver()def get(self):url = str(request.args['url'])self.driver.get(url)return make_response(self.driver.page_source)api.add_resource(Selenium, '/')if __name__ == '__main__':app.run(debug=True)

This is my scrapy spider which takes the benefit of that response to parse the title from a webpage.

import scrapy
from urllib.parse import quote
from scrapy.crawler import CrawlerProcessclass StackSpider(scrapy.Spider):name = 'stackoverflow'url = 'https://stackoverflow.com/questions/tagged/web-scraping?sort=newest&pageSize=50'base = 'https://stackoverflow.com'def start_requests(self):link = 'http://127.0.0.1:5000/?url={}'.format(quote(self.url))yield scrapy.Request(link,callback=self.parse)def parse(self, response):for item in response.css(".summary .question-hyperlink::attr(href)").getall():nlink = self.base + itemlink = 'http://127.0.0.1:5000/?url={}'.format(quote(nlink))yield scrapy.Request(link,callback=self.parse_info,dont_filter=True)def parse_info(self, response):item = response.css('h1[itemprop="name"] > a::text').get()yield {"title":item}if __name__ == '__main__':c = CrawlerProcess()c.crawl(StackSpider)c.start()

The problem is the above script gives me the same title multiple times and then another title and so on.

What possible chage should I bring about to make my script work in the right way?

Answer

I ran both the scripts, and they run as intended. So my findings :

  1. downloader/exception_type_count/twisted.internet.error.ConnectionRefusedError there is no means to get through this error without permission of the server, here i.e ebay.

  2. Logs from scrapy:

    2019-05-25 07:28:41 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/exception_count': 72, 'downloader/exception_type_count/twisted.internet.error.ConnectionRefusedError': 64, 'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 8, 'downloader/request_bytes': 55523, 'downloader/request_count': 81, 'downloader/request_method_count/GET': 81, 'downloader/response_bytes': 2448476, 'downloader/response_count': 9, 'downloader/response_status_count/200': 9, 'finish_reason': 'shutdown', 'finish_time': datetime.datetime(2019, 5, 25, 1, 58, 41, 234183), 'item_scraped_count': 8, 'log_count/DEBUG': 90, 'log_count/INFO': 9, 'request_depth_max': 1, 'response_received_count': 9, 'retry/count': 72, 'retry/reason_count/twisted.internet.error.ConnectionRefusedError': 64, 'retry/reason_count/twisted.web._newclient.ResponseNeverReceived': 8, 'scheduler/dequeued': 81, 'scheduler/dequeued/memory': 81, 'scheduler/enqueued': 131, 'scheduler/enqueued/memory': 131, 'start_time': datetime.datetime(2019, 5, 25, 1, 56, 57, 751009)} 2019-05-25 07:28:41 [scrapy.core.engine] INFO: Spider closed (shutdown)

you can see only 8 items scraped. These are just the logos and other unrestricted things.

  1. Server Log :

    s://.ebaystatic.com http://.ebay.com https://*.ebay.com". Either the 'unsafe-inline' keyword, a hash ('sha256-40GZDfucnPVwbvI/Q1ivGUuJtX8krq8jy3tWNrA/n58='), or a nonce ('nonce-...') is required to enable inline execution. ", source: https://vi.vipr.ebaydesc.com/ws/eBayISAPI.dll?ViewItemDescV4&item=323815597324&t=0&tid=10&category=169291&seller=wardrobe-ltd&excSoj=1&excTrk=1&lsite=0&ittenable=false&domain=ebay.com&descgauge=1&cspheader=1&oneClk=1&secureDesc=1 (1)

Ebay does not allow you to scrape itself.

So how to complete your task >>

  1. Everytime before scraping check /robots.txt for the same site. For ebay its : http://www.ebay.com/robots.txt And you can see almost everything is disallowed.

    User-agent: * Disallow: /*rt=nc Disallow: /b/LH_ Disallow: /brw/ Disallow: /clp/ Disallow: /clt/store/ Disallow: /csc/ Disallow: /ctg/ Disallow: /ctm/ Disallow: /dsc/ Disallow: /edc/ Disallow: /feed/ Disallow: /gsr/ Disallow: /gwc/ Disallow: /hcp/ Disallow: /itc/ Disallow: /lit/ Disallow: /lst/ng/ Disallow: /lvx/ Disallow: /mbf/ Disallow: /mla/ Disallow: /mlt/ Disallow: /myb/ Disallow: /mys/ Disallow: /prp/ Disallow: /rcm/ Disallow: /sch/%7C Disallow: /sch/LH_ Disallow: /sch/aop/ Disallow: /sch/ctg/ Disallow: /sl/node Disallow: /sme/ Disallow: /soc/ Disallow: /talk/ Disallow: /tickets/ Disallow: /today/ Disallow: /trylater/ Disallow: /urw/write-review/ Disallow: /vsp/ Disallow: /ws/ Disallow: /sch/modules=SEARCH_REFINEMENTS_MODEL_V2 Disallow: /b/modules=SEARCH_REFINEMENTS_MODEL_V2 Disallow: /itm/_nkw Disallow: /itm/?fits Disallow: /itm/&fits Disallow: /cta/

  2. Therefore go to https://developer.ebay.com/api-docs/developer/static/developer-landing.html and check their docs, there are easier example code in their site to get the items needs without scraping.

https://en.xdnf.cn/q/72285.html

Related Q&A

using variable in a url in python

Sorry for this very basic question. I am new to Python and trying to write a script which can print the URL links. The IP addresses are stored in a file named list.txt. How should I use the variable in…

Create dynamic updated graph with Python

I need to write a script in Python that will take dynamically changed data, the source of data is not matter here, and display graph on the screen. I know how to use matplotlib, but the problem with m…

Converting a nested dictionary to a list

I know there are many dict to list questions on here but I cant quite find the information I need for my situation so Im asking a new quetion.Some background: Im using a hierarchical package for my mod…

Pandas dataframe : Operation per batch of rows

I have a pandas DataFrame df for which I want to compute some statistics per batch of rows. For example, lets say that I have a batch_size = 200000. For each batch of batch_size rows I would like to ha…

Combining grid/pack Tkinter

I know there have been many questions on grid and pack in the past but I just dont understand how to combine the two as Im having difficulties expanding my table in both directions (row/column).Buttons…

Mocking assert_called_with in Python

Im having some trouble understanding why the following code does not pass:test.pyimport mock import unittestfrom foo import Fooclass TestFoo(unittest.TestCase):@mock.patch(foo.Bar)def test_foo_add(self…

Python select() behavior is strange

Im having some trouble understanding the behavior of select.select. Please consider the following Python program:def str_to_hex(s):def dig(n):if n > 9:return chr(65-10+n)else:return chr(48+n)r = wh…

Moving items up and down in a QListWidget?

In a QListWidget I have a set of entries. Now I want to allow the user to sort (reorder) these entries through two buttons (Up/Down).Heres part of my code:def __init__(self):QtGui.QMainWindow.__init__(…

How to resize a tiff image with multiple channels?

I have a tiff image of size 21 X 513 X 513 where (513, 513) is the height and width of the image containing 21 channels. How can I resize this image to 21 X 500 X 375?I am trying to use PILLOW to do …

Losing elements in python code while creating a dictionary from a list?

I have some headache with this python code.print "length:", len(pub) # length: 420pub_dict = dict((p.key, p) for p in pub)print "dict:", len(pub_dict) # length: 163If I understand t…