How to stop scrapy spider after certain number of requests?

2024/10/3 12:27:19

I am developing an simple scraper to get 9 gag posts and its images but due to some technical difficulties iam unable to stop the scraper and it keeps on scraping which i dont want.I want to increase the counter value and stop after 100 posts. But the 9gag page was designed in a fashion in each response it gives only 10 posts and after each iteration my counter value resets to 10 in this case my loop runs infintely long and never stops.

# -*- coding: utf-8 -*-
import scrapy
from _9gag.items import GagItemclass FirstSpider(scrapy.Spider):name = "first"allowed_domains = [""]start_urls = ('',)last_gag_id = Nonedef parse(self, response):count = 0for article in response.xpath('//article'):gag_id = article.xpath('@data-entry-id').extract()count +=1if gag_id:if (count != 100):last_gag_id = gag_id[0]ninegag_item = GagItem()ninegag_item['entry_id'] = gag_id[0]ninegag_item['url'] = article.xpath('@data-entry-url').extract()[0]ninegag_item['votes'] = article.xpath('@data-entry-votes').extract()[0]ninegag_item['comments'] = article.xpath('@data-entry-comments').extract()[0]ninegag_item['title'] = article.xpath('.//h2/a/text()').extract()[0].strip()ninegag_item['img_url'] = article.xpath('.//div[1]/a/img/@src').extract()yield ninegag_itemelse:breaknext_url = '' % last_gag_idyield scrapy.Request(url=next_url, callback=self.parse) print count

Code for is here

from scrapy.item import Item, Fieldclass GagItem(Item):entry_id = Field()url = Field()votes = Field()comments = Field()title = Field()img_url = Field()

So i want to increase a global count value and tried this by passing 3 arguments to parse function it gives error

TypeError: parse() takes exactly 3 arguments (2 given)

So is there a way to pass a global count value and return it after each iteration and stop after 100 posts(suppose).

Entire project is available here Github Even if i set POST_LIMIT =100 the infinite loop happens,see here command i executed

scrapy crawl first -s POST_LIMIT=10 --output=output.json

There's a built-in setting CLOSESPIDER_PAGECOUNT that can be passed via command-line -s argument or changed in settings: scrapy crawl <spider> -s CLOSESPIDER_PAGECOUNT=100

One small caveat is that if you've enabled caching, it will count cache hits as page counts as well.

