I am trying to build a spider that could efficiently scrape text information from many websites. Since I am a Python user I was referred to Scrapy. However, in order to avoid scraping huge websites, I want to limit the spider to scrape no more than 20 pages of a certain "depth" per website. Here is my spider:
class DownloadSpider(CrawlSpider):name = 'downloader'download_path = '/home/MyProjects/crawler'rules = (Rule(SgmlLinkExtractor(), callback='parse_item', follow=True),)def __init__(self, *args, **kwargs):super(DownloadSpider, self).__init__(*args, **kwargs)self.urls_file_path = [kwargs.get('urls_file')]data = open(self.urls_file_path[0], 'r').readlines()self.allowed_domains = [urlparse(i).hostname.strip() for i in data] self.start_urls = ['http://' + domain for domain in self.allowed_domains]def parse_start_url(self, response):return self.parse_item(response)def parse_item(self, response):self.fname = self.download_path + urlparse(response.url).hostname.strip()open(str(self.fname)+ '.txt', 'a').write(response.url)open(str(self.fname)+ '.txt', 'a').write('\n')
urls_file is a path to a text file with urls. I have also set the max depth in the settings file. Here is my problem: if I set the CLOSESPIDER_PAGECOUNT
exception it closes the spider when the total number of scraped pages (regardless for which site) reaches the exception value. However, I need to stop scraping when I have scraped say 20 pages from each url.
I also tried keeping count with a variable like self.parsed_number += 1, but this didn't work either -- it seems that scrapy doesn't go url by url but mixes them up.
Any advice is much appreciated !