Scrapy: AttributeError: YourCrawler object has no attribute parse_following_urls

2024/10/13 11:25:16

I am writing a scrapy spider. I have been reading this question: Scrapy: scraping a list of links, and I can make it recognise the urls in a listpage, but I cant make it go inside the urls and save the data I want to see.

from scrapy.contrib.spiders import CrawlSpider
from scrapy.selector import Selector
from scrapy.http import Requestclass YourCrawler(CrawlSpider):name = "bookstore_2"start_urls = ['https://example.com/materias/?novedades=LC&p',]def parse(self, response):# go to the urls in the lists = Selector(response)page_list_urls = s.xpath('///*[@id="results"]/ul/li/div[1]/h4/a[2]/@href').extract()for url in page_list_urls:yield Request(url, callback=self.parse_following_urls, dont_filter=True)# For the urls in the list, go inside, and in div#main, take the div.ficha > div.caracteristicas > ul > lidef parse_following_urls(self, response):#Parsing rules go herefor each_book in response.css('div#main'):yield {'book_isbn': each_book.css('div.ficha > div.caracteristicas > ul > li').extract(),}# Return back and go to bext page in div#paginat ul li.next a::attr(href) and begin againnext_page = response.css('div#paginat ul li.next a::attr(href)').extract_first()if next_page is not None:next_page = response.urljoin(next_page)yield scrapy.Request(next_page, callback=self.parse)

It gives an error:

AttributeError: 'YourCrawler' object has no attribute 'parse_following_urls'

And I don't understand why!

EDIT --

As the response says, I had to close the method with the indentation like here:

from scrapy.contrib.spiders import CrawlSpider
from scrapy.selector import Selector
from scrapy.http import Requestclass YourCrawler(CrawlSpider):name = "bookstore_2"start_urls = ['https://example.com/materias/?novedades=LC&p',]def parse(self, response):# go to the urls in the lists = Selector(response)page_list_urls = s.xpath('///*[@id="results"]/ul/li/div[1]/h4/a[2]/@href').extract()for url in page_list_urls:yield Request(url, callback=self.parse_following_urls, dont_filter=True)# For the urls in the list, go inside, and in div#main, take the div.ficha > div.caracteristicas > ul > lidef parse_following_urls(self, response):#Parsing rules go herefor each_book in response.css('div#main'):yield {'book_isbn': each_book.css('div.ficha > div.caracteristicas > ul > li').extract(),}# Return back and go to bext page in div#paginat ul li.next a::attr(href) and begin againnext_page = response.css('div#paginat ul li.next a::attr(href)').extract_first()if next_page is not None:next_page = response.urljoin(next_page)yield scrapy.Request(next_page, callback=self.parse)

But there is another problem, I think related to the urls, and now I am having this traceback:

Traceback (most recent call last):File "/usr/local/lib/python2.7/site-packages/scrapy/utils/defer.py", line 102, in iter_errbackyield next(it)File "/usr/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_outputfor x in result:File "/usr/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/referer.py", line 339, in <genexpr>return (_set_referer(r) for r in result or ())File "/usr/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>return (r for r in result or () if _filter(r))File "/usr/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>return (r for r in result or () if _filter(r))File "/Users/nikita/scrapy/bookstore_2/bookstore_2/spiders/bookstore_2.py", line 16, in parseyield Request(url, callback=self.parse_following_urls, dont_filter=True)File "/usr/local/lib/python2.7/site-packages/scrapy/http/request/__init__.py", line 25, in __init__self._set_url(url)File "/usr/local/lib/python2.7/site-packages/scrapy/http/request/__init__.py", line 58, in _set_urlraise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url: /book/?id=9780374281083

Maybe because I have to tell scrappy what is the base url? Should I add somewhere a urljoin?

EDIT_2 ---

Ok, the problem was with the urls. Adding

response.urljoin(

solved this issue.

Answer

In your code,

  yield Request(url, callback=self.parse_following_urls, dont_filter=True)

you used parse_following_urls with self.
But parse_following_urls is defined in parse function, so it isn't a method of YourCrawler.
That's why the error said
AttributeError: 'YourCrawler' object has no attribute 'parse_following_urls'
you should assign it like:

class YourCrawler(CrawlSpider):def parse_following_urls(self, response):....

to make it method of the class.

edit

for a additional question:

In your code s.xpath('///*[@id="results"]/ul/li/div[1]/h4/a[2]/@href') indicates the href attribute of the element a tag of the html page you want to scrap.
However, it is only '/book/?id=9780374281083', not the full url.
so, you should make it like : https://lacentral.com/book/?id=9780374281083to use it.

https://en.xdnf.cn/q/118090.html

Related Q&A

initializer is not a constant, error C2099, on compiling a module written in c for python

i tried to compile a python module called distance, whith c "python setup.py install --with-c" using msvc 2017 on windows 10, i got this error ,Cdistance / distance.c (647): error C2099: init…

How can make pandas columns compare check cell?

I have a two file. a.txt has the below data.Zone,Aliase1,Aliase2 VNX7600SPB3_8B3_H1,VNX7600SPB3,8B3_H1 VNX7600SPBA_8B4_H1,VNX7600SPA3,8B4_H1 CX480SPA1_11B3_H1,CX480SPA1,11B3_H1 CX480SPB1_11B4_H1,CX480S…

Flask argument of type _RequestGlobals is not iterable

When I tried to use Flask-WTForms, I followed these steps:from flask_wtf import Form from wtforms import StringField, PasswordField from wtforms.validators import DataRequired, Emailclass EmailPassword…

PumpStreamHandler can capture the process output in realtime

I try to capture a python process output via apache-commons-exec. But it looks like it wont print the output, the output is only displayed after I the python process is finished.Heres my java codeComma…

Freezing a CNN tensorflow model into a .pb file

Im currently experimenting with superresolution using CNNs. To serve my model Ill need to frezze it first, into a .pb file, right? Being a newbie I dont really know how to do that. My model basically …

flattening a list or a tuple in python. Not sure what the error is

def flatten(t):list = []for i in t:if(type(i) != list and type(i) != tuple):list.append(i)else:list.extend(flatten(i))return listHere is the function that Ive written to flatten a list or a tuple that …

Counting word occurrences in csv and determine row appearances

I have a csv file such as the following in one column. The symbols and numbers are only to show that the file does not just contain text. I have two objectives:count the number of occurrences of a wo…

Converting Python 3.6 script to .exe? [duplicate]

This question already has answers here:How can I convert a .py to .exe for Python?(8 answers)Closed 6 years ago.I would like to convert a .py file to an .exe. I am using Python 3.6. I already tried py…

Error while reading csv file in python

I tried to run following programme in ubuntu terminal but I am getting some error. But it is not giving any error in jupyter notebookFile "imsl.py", line 5 SyntaxError: Non-ASCII character \x…

Searching for a USB in Python is returning there is no disk in drive

I wrote up a function in Python that looks for a USB drive based on a key identifier file, however when called upon it returns There is no disk in the drive. Please insert a disk into drive D:/ (which …