Scrapy: how to catch download error and try download it again

2024/10/5 1:15:22

During my crawling, some pages failed due to unexpected redirection and no response returned. How can I catch this kind of error and re-schedule a request with original url, not with the redirected url?

Before I ask here, I do a lot of search with Google. Looks there's two ways to fix this issue. one is catch exception in a download middle-ware, the other is to process download exception in errback in spider's request. For these two questions, I have some questions.

  • For method 1, I don't know how to pass the original url to process_exception function. Below is the example code I have tried.
class ProxyMiddleware(object):def process_request(self, request, spider):request.meta['proxy'] = "http://192.168.10.10"log.msg('>>>> Proxy %s'%(request.meta['proxy'] if request.meta['proxy'] else ""), level=log.DEBUG)def process_exception(self, request, exception, spider):log_msg('Failed to request url %s with proxy %s with exception %s' % (request.url, proxy if proxy else 'nil', str(exception)))#retry again.return request
  • For method 2, I don't know how to pass external parameter to errback function in the spider. I don't know how to retrieve original url from this errback function to re-schedule a request.

    Below is the example I tried with method 2:

class ProxytestSpider(Spider):name = "proxytest"allowed_domains = ["baidu.com"]start_urls = ('http://www.baidu.com/',)def make_requests_from_url(self, url):starturl = urlrequest = Request(url, dont_filter=True,callback = self.parse, errback = self.download_errback)print "make requests"return requestdef parse(self, response):passprint "in parse function"        def download_errback(self, e):print type(e), repr(e)print repr(e.value)print "in downloaderror_callback"

Any suggestion for this recrawl issue is highly appreciated. Thanks in advance.

Regards

Bing

Answer

You could pass a lambda as an errback:

request = Request(url, dont_filter=True,callback = self.parse, errback = lambda x: self.download_errback(x, url))

that way you'll have access to the url inside the errback function:

def download_errback(self, e, url):print url
https://en.xdnf.cn/q/70543.html

Related Q&A

Cryptacular is broken

this weekend our docker image broke because it cannot be build anymore. While looking into the stats, I saw this line:crypt_blowfish-1.2/crypt.h:17:23: fatal error: gnu-crypt.h: No such file or directo…

how to run test against the built image before pushing to containers registry?

From the gitlab documentation this is how to create a docker image using kaniko: build:stage: buildimage:name: gcr.io/kaniko-project/executor:debugentrypoint: [""]script:- mkdir -p /kaniko/.d…

Adding a colorbar to a pcolormesh with polar projection

I am trying to add a colorbar to a pcolormesh plot with polar projection. The code works fine if I dont specify a polar projection. With polar projection specified, a tiny plot results, and the colorba…

GridSearch for Multi-label classification in Scikit-learn

I am trying to do GridSearch for best hyper-parameters in every individual one of ten folds cross validation, it worked fine with my previous multi-class classification work, but not the case this time…

Visualize tree in bash, like the output of unix tree

Given input:apple: banana eggplant banana: cantaloupe durian eggplant: fig:I would like to concatenate it into the format:├─ apple │ ├─ banana │ │ ├─ cantaloupe │ │ └─ durian │ └…

pygame.error: Failed loading libmpg123.dll: Attempt to access invalid address

music = pygame.mixer.music.load(not.mp3) pygame.mixer.music.play(loops=-1)when executing the code I got this error: Traceback (most recent call last):File "C:\Users\Admin\AppData\Local\Programs\Py…

Plot Red Channel from 3D Numpy Array

Suppose that we have an RGB image that we have converted it to a Numpy array with the following code:import numpy as np from PIL import Imageimg = Image.open(Peppers.tif) arr = np.array(img) # 256x256x…

How to remove image noise using opencv - python?

I am working with skin images, in recognition of skin blemishes, and due to the presence of noises, mainly by the presence of hairs, this work becomes more complicated.I have an image example in which …

Django groups and permissions

I would like to create 2 groups (Professors, Students). And I would like to restrict students from creating and deleting Courses.views.py:def is_professor(function=None):def _is_professor(u):if user.gr…

How to (properly) use external credentials in an AWS Lambda function?

I have a (extremely basic but perfectly working) AWS lambda function written in Python that however has embedded credentials to connect to: 1) an external web service 2) a DynamoDB table. What the fu…