Is it possible to delay the retry of a particular scrapy Request. I have a middleware which needs to defer the request of a page until a later time. I know how to do the basic deferal (end of queue), and also how to delay all requests (global settings), but I want to just delay this one individual request. This is most important near the end of the queue, where if I do the simple deferral it immediately becomes the next request again.
Method 1
One way would be to add a middleware to your Spider (source, linked):
# File: middlewares.pyfrom twisted.internet import reactor
from twisted.internet.defer import Deferredclass DelayedRequestsMiddleware(object):def process_request(self, request, spider):delay_s = request.meta.get('delay_request_by', None)if not delay_s:returndeferred = Deferred()reactor.callLater(delay_s, deferred.callback, None)return deferred
Which you could later use in your Spider like this:
import scrapyclass QuotesSpider(scrapy.Spider):name = "quotes"custom_settings = {'DOWNLOADER_MIDDLEWARES': {'middlewares.DelayedRequestsMiddleware': 123},}def start_requests(self):# This request will have itself delayed by 5 secondsyield scrapy.Request(url='http://quotes.toscrape.com/page/1/', meta={'delay_request_by': 5})# This request will not be delayedyield scrapy.Request(url='http://quotes.toscrape.com/page/2/')def parse(self, response):... # Process results here
Method 2
You could do this with a Custom Retry Middleware (source), you just need to override the process_response
method of the current Retry Middleware:
from scrapy.downloadermiddlewares.retry import RetryMiddleware
from scrapy.utils.response import response_status_messageclass CustomRetryMiddleware(RetryMiddleware):def process_response(self, request, response, spider):if request.meta.get('dont_retry', False):return responseif response.status in self.retry_http_codes:reason = response_status_message(response.status)# Your delay code here, for example sleep(10) or polling server until it is alivereturn self._retry(request, reason, spider) or responsereturn response
Then enable it instead of the default RetryMiddleware
in settings.py
:
DOWNLOADER_MIDDLEWARES = {'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,'myproject.middlewarefilepath.CustomRetryMiddleware': 550,
}