How to limit number of followed pages per site in Python Scrapy

2024/10/9 20:28:51

I am trying to build a spider that could efficiently scrape text information from many websites. Since I am a Python user I was referred to Scrapy. However, in order to avoid scraping huge websites, I want to limit the spider to scrape no more than 20 pages of a certain "depth" per website. Here is my spider:

class DownloadSpider(CrawlSpider):name = 'downloader'download_path = '/home/MyProjects/crawler'rules = (Rule(SgmlLinkExtractor(), callback='parse_item', follow=True),)def __init__(self, *args, **kwargs):super(DownloadSpider, self).__init__(*args, **kwargs)self.urls_file_path = [kwargs.get('urls_file')]data = open(self.urls_file_path[0], 'r').readlines()self.allowed_domains = [urlparse(i).hostname.strip() for i in data] self.start_urls = ['http://' + domain for domain in self.allowed_domains]def parse_start_url(self, response):return self.parse_item(response)def parse_item(self, response):self.fname = self.download_path + urlparse(response.url).hostname.strip()open(str(self.fname)+ '.txt', 'a').write(response.url)open(str(self.fname)+ '.txt', 'a').write('\n')

urls_file is a path to a text file with urls. I have also set the max depth in the settings file. Here is my problem: if I set the CLOSESPIDER_PAGECOUNT exception it closes the spider when the total number of scraped pages (regardless for which site) reaches the exception value. However, I need to stop scraping when I have scraped say 20 pages from each url. I also tried keeping count with a variable like self.parsed_number += 1, but this didn't work either -- it seems that scrapy doesn't go url by url but mixes them up. Any advice is much appreciated !

Answer

To do this you can create your own link extractor class based on SgmlLinkExtractor. It should look something like this:

from scrapy.selector import Selector
from scrapy.utils.response import get_base_urlfrom scrapy.contrib.linkextractors.sgml import SgmlLinkExtractorclass LimitedLinkExtractor(SgmlLinkExtractor):def __init__(self, allow=(), deny=(), allow_domains=(), deny_domains=(), restrict_xpaths=(),tags=('a', 'area'), attrs=('href'), canonicalize=True, unique=True, process_value=None,deny_extensions=None, max_pages=20):self.max_pages=max_pagesSgmlLinkExtractor.__init__(self, allow=allow, deny=deny, allow_domains=allow_domains, deny_domains=deny_domains, restrict_xpaths=restrict_xpaths,tags=tags, attrs=attrs, canonicalize=canonicalize, unique=unique, process_value=process_value,deny_extensions=deny_extensions)def extract_links(self, response):base_url = Noneif self.restrict_xpaths:sel = Selector(response)base_url = get_base_url(response)body = u''.join(ffor x in self.restrict_xpathsfor f in sel.xpath(x).extract()).encode(response.encoding, errors='xmlcharrefreplace')else:body = response.bodylinks = self._extract_links(body, response.url, response.encoding, base_url)links = self._process_links(links)links = links[0:self.max_pages]return links

The code of this subclass completely based on the code of the class SgmlLinkExtractor. I've just added variable self.max_pages to the class constructor and line which cut the list of links in the end of extract_links method. But you can cut this list in more intelligent way.

https://en.xdnf.cn/q/69973.html

Related Q&A

Why is Pythons sorted() slower than copy, then .sort()

Here is the code I ran:import timeitprint timeit.Timer(a = sorted(x), x = [(2, bla), (4, boo), (3, 4), (1, 2) , (0, 1), (4, 3), (2, 1) , (0, 0)]).timeit(number = 1000) print timeit.Timer(a=x[:];a.sort(…

How to efficiently unroll a matrix by value with numpy?

I have a matrix M with values 0 through N within it. Id like to unroll this matrix to create a new matrix A where each submatrix A[i, :, :] represents whether or not M == i.The solution below uses a lo…

Anaconda Python 3.6 -- pythonw and python supposed to be equivalent?

According to Python 3 documentation, python and pythonw should be equivalent for running GUI scripts as of 3.6With older versions of Python, there is one Mac OS X quirk that you need to be aware of: pr…

Good way of handling NoneType objects when printing in Python

How do I go about printin a NoneType object in Python?# score can be a NonType object logging.info("NEW_SCORE : "+score)Also why is that sometime I see a comma instead of the + above?

problems with easy_install pycrypto

Im trying install pycrypto on osx with easy_install and Im getting the following error:easy_install pycrypto Searching for pycrypto Reading http://pypi.python.org/simple/pycrypto/ Reading http://pycryp…

What is the most efficient way to do a sorted reduce in PySpark?

I am analyzing on-time performance records of US domestic flights from 2015. I need to group by tail number, and store a date sorted list of all the flights for each tail number in a database, to be re…

Interactive figure with OO Matplotlib

Using Matplotlib via the OO API is easy enough for a non-interactive backend:from matplotlib.backends.backend_agg import FigureCanvasAgg as FigureCanvasfrom matplotlib.figure import Figurefig = Figure(…

nose2 vs py.test with isolated processes

We have been using nosetest for running and collecting our unittests (which are all written as python unittests which we like). Things we like about nose:uses standard python unit tests (we like the st…

ValueError: Attempt to reuse RNNCell with a different variable scope than its first use

The following code fragmentimport tensorflow as tf from tensorflow.contrib import rnnhidden_size = 100 batch_size = 100 num_steps = 100 num_layers = 100 is_training = True keep_prob = 0.4input_da…

Convex Hull and SciPy

Im trying to use scipy (0.10.1) for a quick hack to visualize the convex hull.I can get the convex hull using the following code:vecs = [[-0.094218, 51.478927], [-0.09348, 51.479364], [-0.094218, 51.4…