Question 1

During my crawling, some pages failed due to unexpected redirection and no response returned. How can I catch this kind of error and re-schedule a request with original url, not with the redirected url?

Before I ask here, I do a lot of search with Google. Looks there's two ways to fix this issue. one is catch exception in a download middle-ware, the other is to process download exception in errback in spider's request. For these two questions, I have some questions.

For method 1, I don't know how to pass the original url to process_exception function. Below is the example code I have tried.

class ProxyMiddleware(object):def process_request(self, request, spider):request.meta['proxy'] = "http://192.168.10.10"log.msg('>>>> Proxy %s'%(request.meta['proxy'] if request.meta['proxy'] else ""), level=log.DEBUG)def process_exception(self, request, exception, spider):log_msg('Failed to request url %s with proxy %s with exception %s' % (request.url, proxy if proxy else 'nil', str(exception)))#retry again.return request

For method 2, I don't know how to pass external parameter to errback function in the spider. I don't know how to retrieve original url from this errback function to re-schedule a request.
Below is the example I tried with method 2:

class ProxytestSpider(Spider):name = "proxytest"allowed_domains = ["baidu.com"]start_urls = ('http://www.baidu.com/',)def make_requests_from_url(self, url):starturl = urlrequest = Request(url, dont_filter=True,callback = self.parse, errback = self.download_errback)print "make requests"return requestdef parse(self, response):passprint "in parse function"        def download_errback(self, e):print type(e), repr(e)print repr(e.value)print "in downloaderror_callback"

Any suggestion for this recrawl issue is highly appreciated. Thanks in advance.

Regards

Bing

Question 2

You could pass a lambda as an errback:

request = Request(url, dont_filter=True,callback = self.parse, errback = lambda x: self.download_errback(x, url))

that way you'll have access to the url inside the errback function:

def download_errback(self, e, url):print url

Scrapy: how to catch download error and try download it again

Related Q&A

Cryptacular is broken

how to run test against the built image before pushing to containers registry?

Adding a colorbar to a pcolormesh with polar projection

GridSearch for Multi-label classification in Scikit-learn

Visualize tree in bash, like the output of unix tree

pygame.error: Failed loading libmpg123.dll: Attempt to access invalid address

Plot Red Channel from 3D Numpy Array

How to remove image noise using opencv - python?

Django groups and permissions

How to (properly) use external credentials in an AWS Lambda function?