Scrapy LinkExtractor - Limit the number of pages crawled per URL

2024/10/9 16:32:48

I am trying to limit the number of crawled pages per URL in a CrawlSpider in Scrapy. I have a list of start_urls and I want to set a limit on the numbers pages are being crawled in each URL. Once the limit is reached, the spider should move to the next start_url.

I know there is the DEPTH_LIMIT parameter on setting but this is not what I am looking for.

Any help will be useful.

Here is the code I currently have:

class MySpider(CrawlSpider):name = 'test'allowed_domains = domainvarwebsitestart_urls = httpvarwebsiterules = [Rule(LinkExtractor(),callback='parse_item',follow=True)]def parse_item(self, response):#here I parse and yield the items I am interested in.

EDIT

I have tried to implement this, but I get exceptions.SyntaxError: invalid syntax (filter_domain.py, line 20) . Any ideas of what is going on?

thanks again.

filter_domain.py

import urlparse
from collections import defaultdict
from scrapy.exceptions import IgnoreRequestclass FilterDomainbyLimitMiddleware(object):
def __init__(self, domains_to_filter):self.domains_to_filter = domains_to_filterself.counter = defaultdict(int)@classmethod
def from_crawler(cls, crawler):settings = crawler.settingsspider_name = crawler.spider.namemax_to_filter = settings.get('MAX_TO_FILTER')o = cls(max_to_filter)return odef process_request(self, request, spider):parsed_url = urlparse.urlparse(request.url)(LINE 20:) if self.counter.get(parsed_url.netloc, 0) < self.max_to_filter[parsed_url.netloc]):self.counter[parsed_url.netloc] += 1else:raise IgnoreRequest()

settings.py

MAX_TO_FILTER = 30DOWNLOADER_MIDDLEWARES = {'myproject.filter_domain.FilterDomainbyLimitMiddleware' :400,}
Answer

Scrapy doesn't offer this directly, but you could create a custom Middleware, something like this:

import urlparse
from collections import defaultdict
from scrapy.exceptions import IgnoreRequestclass FilterDomainbyLimitMiddleware(object):def __init__(self, domains_to_filter):self.domains_to_filter = domains_to_filterself.counter = defaultdict(int)@classmethoddef from_crawler(cls, crawler):settings = crawler.settingsspider_name = crawler.spider.namedomains_to_filter = settings.get('DOMAINS_TO_FILTER')o = cls(domains_to_filter)return odef process_request(self, request, spider):parsed_url = urlparse.urlparse(request.url)if parsed_url.netloc in self.domains_to_filter:if self.counter.get(parsed_url.netloc, 0) < self.domains_to_filter[parsed_url.netloc]):self.counter[parsed_url.netloc] += 1else:raise IgnoreRequest()

and declaring the DOMAINS_TO_FILTER in settings like:

DOMAINS_TO_FILTER = {'mydomain': 5
}

to only accept 5 requests from that domain. Also remember to enable the middleware in settings like specified here

https://en.xdnf.cn/q/70001.html

Related Q&A

Python Invalid format string [duplicate]

This question already has answers here:Python time formatting different in Windows(3 answers)Closed 9 years ago.I am trying to print the date in the following format using strftime: 06-03-2007 05:40PMI…

Python template safe substitution with the custom double-braces format

I am trying to substitute variables in the format {{var}} with Pythons Template. from string import Templateclass CustomTemplate(Template):delimiter = {{pattern = r\{\{(?:(?P<escaped>\{\{)|(?P…

Emit signal in standard python thread

I have a threaded application where I do have a network thread. The UI-part passes a callback to this thread. The thread is a normal python thread - its NO QThread.Is it possible to emit PyQT Slot with…

Sqlalchemy from_statement() cannot locate column

I am following the sqlalchemy tutorial in http://docs.sqlalchemy.org/en/rel_0_9/orm/tutorial.htmlNevertheless, instead of using a SQLite backend, I am using MySQL. The problem is that when I try to exe…

Python - how to check if weak reference is still available

I am passing some weakrefs from Python into C++ class, but C++ destructors are actively trying to access the ref when the real object is already dead, obviously it crashes...Is there any Python C/API a…

Django using locals() [duplicate]

This question already has answers here:Django template and the locals trick(8 answers)Closed 5 years ago.I am beginner in web development with Django. I have noticed that the locals() function is used …

python ghostscript: RuntimeError: Can not find Ghostscript library (libgs)

When trying to run hello-world exampleimport sys import ghostscriptargs = ["ps2pdf", # actual value doesnt matter"-dNOPAUSE", "-dBATCH", "-dSAFER","-sDEVICE…

what is the default encoding when python Requests post data is string type?

with fhe following codepayload = 工作报告 总体情况:良好 r = requests.post("http://httpbin.org/post", data=payload)what is the default encoding when Requests post data is string type? UTF8…

How to initialize a database connection only once and reuse it in run-time in python?

I am currently working on a huge project, which constantly executes queries. My problem is, that my old code always created a new database connection and cursor, which decreased the speed immensivly. S…

Django - ModelForm: Add a field not belonging to the model

Note: Using django-crispy-forms library for my form. If you have a solution to my problem that involves not using the cripsy_forms library, I accept it all the same. Not trying to be picky just need a …