How to stop scrapy spider after certain number of requests?

2024/10/3 12:27:19

I am developing an simple scraper to get 9 gag posts and its images but due to some technical difficulties iam unable to stop the scraper and it keeps on scraping which i dont want.I want to increase the counter value and stop after 100 posts. But the 9gag page was designed in a fashion in each response it gives only 10 posts and after each iteration my counter value resets to 10 in this case my loop runs infintely long and never stops.


# -*- coding: utf-8 -*-
import scrapy
from _9gag.items import GagItemclass FirstSpider(scrapy.Spider):name = "first"allowed_domains = ["9gag.com"]start_urls = ('http://www.9gag.com/',)last_gag_id = Nonedef parse(self, response):count = 0for article in response.xpath('//article'):gag_id = article.xpath('@data-entry-id').extract()count +=1if gag_id:if (count != 100):last_gag_id = gag_id[0]ninegag_item = GagItem()ninegag_item['entry_id'] = gag_id[0]ninegag_item['url'] = article.xpath('@data-entry-url').extract()[0]ninegag_item['votes'] = article.xpath('@data-entry-votes').extract()[0]ninegag_item['comments'] = article.xpath('@data-entry-comments').extract()[0]ninegag_item['title'] = article.xpath('.//h2/a/text()').extract()[0].strip()ninegag_item['img_url'] = article.xpath('.//div[1]/a/img/@src').extract()yield ninegag_itemelse:breaknext_url = 'http://9gag.com/?id=%s&c=200' % last_gag_idyield scrapy.Request(url=next_url, callback=self.parse) print count

Code for items.py is here

from scrapy.item import Item, Fieldclass GagItem(Item):entry_id = Field()url = Field()votes = Field()comments = Field()title = Field()img_url = Field()

So i want to increase a global count value and tried this by passing 3 arguments to parse function it gives error

TypeError: parse() takes exactly 3 arguments (2 given)

So is there a way to pass a global count value and return it after each iteration and stop after 100 posts(suppose).

Entire project is available here Github Even if i set POST_LIMIT =100 the infinite loop happens,see here command i executed

scrapy crawl first -s POST_LIMIT=10 --output=output.json
Answer

There's a built-in setting CLOSESPIDER_PAGECOUNT that can be passed via command-line -s argument or changed in settings: scrapy crawl <spider> -s CLOSESPIDER_PAGECOUNT=100

One small caveat is that if you've enabled caching, it will count cache hits as page counts as well.

https://en.xdnf.cn/q/70727.html

Related Q&A

What is the difference between single and double bracket Numpy array?

import numpy as np a=np.random.randn(1, 2) b=np.zeros((1,2)) print("Data type of A: ",type(a)) print("Data type of A: ",type(b))Output:Data type of A: <class numpy.ndarray> D…

How to make tkinter button widget take up full width of grid

Ive tried this but it didnt help. Im making a calculator program. Ive made this so far: from tkinter import * window = Tk()disp = Entry(window, state=readonly, readonlybackground="white") dis…

Python strip() unicode string?

How can you use string methods like strip() on a unicode string? and cant you access characters of a unicode string like with oridnary strings? (ex: mystring[0:4] )

Python equivalent for MATLABs normplot?

Is there a python equivalent function similar to normplot from MATLAB? Perhaps in matplotlib?MATLAB syntax:x = normrnd(10,1,25,1); normplot(x)Gives:I have tried using matplotlib & numpy module to…

python mask netcdf data using shapefile

I am using the following packages:import pandas as pd import numpy as np import xarray as xr import geopandas as gpdI have the following objects storing data:print(precip_da)Out[]:<xarray.DataArray …

Whats a good general way to look SQLAlchemy transactions, complete with authenticated user, etc?

Im using SQLAlchemys declarative extension. Id like all changes to tables logs, including changes in many-to-many relationships (mapping tables). Each table should have a separate "log" table…

OpenCV - Tilted camera and triangulation landmark for stereo vision

I am using a stereo system and so I am trying to get world coordinates of some points by triangulation.My cameras present an angle, the Z axis direction (direction of the depth) is not normal to my sur…

Node.jss python child script outputting on finish, not real time

I am new to node.js and socket.io and I am trying to write a small server that will update a webpage based on python output. Eventually this will be used for a temperature sensor so for now I have a du…

lambda function returning the key value for use in defaultdict

The function collections.defaultdict returns a default value that can be defined by a lambda function of my own making if the key is absent from my dictionary.Now, I wish my defaultdict to return the u…

Calling Matlab function from python

I have one project in which I have one one matlab code which I have to run tho Django. I tried installing Mlabwrap ..But it gives me following error.Traceback (most recent call last): File "<st…