Question 1

I am trying to build a spider that could efficiently scrape text information from many websites. Since I am a Python user I was referred to Scrapy. However, in order to avoid scraping huge websites, I want to limit the spider to scrape no more than 20 pages of a certain "depth" per website. Here is my spider:

class DownloadSpider(CrawlSpider):name = 'downloader'download_path = '/home/MyProjects/crawler'rules = (Rule(SgmlLinkExtractor(), callback='parse_item', follow=True),)def __init__(self, *args, **kwargs):super(DownloadSpider, self).__init__(*args, **kwargs)self.urls_file_path = [kwargs.get('urls_file')]data = open(self.urls_file_path[0], 'r').readlines()self.allowed_domains = [urlparse(i).hostname.strip() for i in data] self.start_urls = ['http://' + domain for domain in self.allowed_domains]def parse_start_url(self, response):return self.parse_item(response)def parse_item(self, response):self.fname = self.download_path + urlparse(response.url).hostname.strip()open(str(self.fname)+ '.txt', 'a').write(response.url)open(str(self.fname)+ '.txt', 'a').write('\n')

urls_file is a path to a text file with urls. I have also set the max depth in the settings file. Here is my problem: if I set the CLOSESPIDER_PAGECOUNT exception it closes the spider when the total number of scraped pages (regardless for which site) reaches the exception value. However, I need to stop scraping when I have scraped say 20 pages from each url. I also tried keeping count with a variable like self.parsed_number += 1, but this didn't work either -- it seems that scrapy doesn't go url by url but mixes them up. Any advice is much appreciated !

Question 2

To do this you can create your own link extractor class based on SgmlLinkExtractor. It should look something like this:

from scrapy.selector import Selector
from scrapy.utils.response import get_base_urlfrom scrapy.contrib.linkextractors.sgml import SgmlLinkExtractorclass LimitedLinkExtractor(SgmlLinkExtractor):def __init__(self, allow=(), deny=(), allow_domains=(), deny_domains=(), restrict_xpaths=(),tags=('a', 'area'), attrs=('href'), canonicalize=True, unique=True, process_value=None,deny_extensions=None, max_pages=20):self.max_pages=max_pagesSgmlLinkExtractor.__init__(self, allow=allow, deny=deny, allow_domains=allow_domains, deny_domains=deny_domains, restrict_xpaths=restrict_xpaths,tags=tags, attrs=attrs, canonicalize=canonicalize, unique=unique, process_value=process_value,deny_extensions=deny_extensions)def extract_links(self, response):base_url = Noneif self.restrict_xpaths:sel = Selector(response)base_url = get_base_url(response)body = u''.join(ffor x in self.restrict_xpathsfor f in sel.xpath(x).extract()).encode(response.encoding, errors='xmlcharrefreplace')else:body = response.bodylinks = self._extract_links(body, response.url, response.encoding, base_url)links = self._process_links(links)links = links[0:self.max_pages]return links

The code of this subclass completely based on the code of the class SgmlLinkExtractor. I've just added variable self.max_pages to the class constructor and line which cut the list of links in the end of extract_links method. But you can cut this list in more intelligent way.

How to limit number of followed pages per site in Python Scrapy

Related Q&A

Why is Pythons sorted() slower than copy, then .sort()

How to efficiently unroll a matrix by value with numpy?

Anaconda Python 3.6 -- pythonw and python supposed to be equivalent?

Good way of handling NoneType objects when printing in Python

problems with easy_install pycrypto

What is the most efficient way to do a sorted reduce in PySpark?

Interactive figure with OO Matplotlib

nose2 vs py.test with isolated processes

ValueError: Attempt to reuse RNNCell with a different variable scope than its first use

Convex Hull and SciPy