I am trying to limit the number of crawled pages per URL in a CrawlSpider in Scrapy. I have a list of start_urls and I want to set a limit on the numbers pages are being crawled in each URL. Once the limit is reached, the spider should move to the next start_url.
I know there is the DEPTH_LIMIT parameter on setting but this is not what I am looking for.
Any help will be useful.
Here is the code I currently have:
class MySpider(CrawlSpider):name = 'test'allowed_domains = domainvarwebsitestart_urls = httpvarwebsiterules = [Rule(LinkExtractor(),callback='parse_item',follow=True)]def parse_item(self, response):#here I parse and yield the items I am interested in.
EDIT
I have tried to implement this, but I get exceptions.SyntaxError: invalid syntax (filter_domain.py, line 20)
. Any ideas of what is going on?
thanks again.
filter_domain.py
import urlparse
from collections import defaultdict
from scrapy.exceptions import IgnoreRequestclass FilterDomainbyLimitMiddleware(object):
def __init__(self, domains_to_filter):self.domains_to_filter = domains_to_filterself.counter = defaultdict(int)@classmethod
def from_crawler(cls, crawler):settings = crawler.settingsspider_name = crawler.spider.namemax_to_filter = settings.get('MAX_TO_FILTER')o = cls(max_to_filter)return odef process_request(self, request, spider):parsed_url = urlparse.urlparse(request.url)(LINE 20:) if self.counter.get(parsed_url.netloc, 0) < self.max_to_filter[parsed_url.netloc]):self.counter[parsed_url.netloc] += 1else:raise IgnoreRequest()
settings.py
MAX_TO_FILTER = 30DOWNLOADER_MIDDLEWARES = {'myproject.filter_domain.FilterDomainbyLimitMiddleware' :400,}