I am writing a scrapy spider. I have been reading this question: Scrapy: scraping a list of links, and I can make it recognise the urls in a listpage, but I cant make it go inside the urls and save the data I want to see.
from scrapy.contrib.spiders import CrawlSpider
from scrapy.selector import Selector
from scrapy.http import Requestclass YourCrawler(CrawlSpider):name = "bookstore_2"start_urls = ['https://example.com/materias/?novedades=LC&p',]def parse(self, response):# go to the urls in the lists = Selector(response)page_list_urls = s.xpath('///*[@id="results"]/ul/li/div[1]/h4/a[2]/@href').extract()for url in page_list_urls:yield Request(url, callback=self.parse_following_urls, dont_filter=True)# For the urls in the list, go inside, and in div#main, take the div.ficha > div.caracteristicas > ul > lidef parse_following_urls(self, response):#Parsing rules go herefor each_book in response.css('div#main'):yield {'book_isbn': each_book.css('div.ficha > div.caracteristicas > ul > li').extract(),}# Return back and go to bext page in div#paginat ul li.next a::attr(href) and begin againnext_page = response.css('div#paginat ul li.next a::attr(href)').extract_first()if next_page is not None:next_page = response.urljoin(next_page)yield scrapy.Request(next_page, callback=self.parse)
It gives an error:
AttributeError: 'YourCrawler' object has no attribute 'parse_following_urls'
And I don't understand why!
EDIT --
As the response says, I had to close the method with the indentation like here:
from scrapy.contrib.spiders import CrawlSpider
from scrapy.selector import Selector
from scrapy.http import Requestclass YourCrawler(CrawlSpider):name = "bookstore_2"start_urls = ['https://example.com/materias/?novedades=LC&p',]def parse(self, response):# go to the urls in the lists = Selector(response)page_list_urls = s.xpath('///*[@id="results"]/ul/li/div[1]/h4/a[2]/@href').extract()for url in page_list_urls:yield Request(url, callback=self.parse_following_urls, dont_filter=True)# For the urls in the list, go inside, and in div#main, take the div.ficha > div.caracteristicas > ul > lidef parse_following_urls(self, response):#Parsing rules go herefor each_book in response.css('div#main'):yield {'book_isbn': each_book.css('div.ficha > div.caracteristicas > ul > li').extract(),}# Return back and go to bext page in div#paginat ul li.next a::attr(href) and begin againnext_page = response.css('div#paginat ul li.next a::attr(href)').extract_first()if next_page is not None:next_page = response.urljoin(next_page)yield scrapy.Request(next_page, callback=self.parse)
But there is another problem, I think related to the urls, and now I am having this traceback:
Traceback (most recent call last):File "/usr/local/lib/python2.7/site-packages/scrapy/utils/defer.py", line 102, in iter_errbackyield next(it)File "/usr/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_outputfor x in result:File "/usr/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/referer.py", line 339, in <genexpr>return (_set_referer(r) for r in result or ())File "/usr/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>return (r for r in result or () if _filter(r))File "/usr/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>return (r for r in result or () if _filter(r))File "/Users/nikita/scrapy/bookstore_2/bookstore_2/spiders/bookstore_2.py", line 16, in parseyield Request(url, callback=self.parse_following_urls, dont_filter=True)File "/usr/local/lib/python2.7/site-packages/scrapy/http/request/__init__.py", line 25, in __init__self._set_url(url)File "/usr/local/lib/python2.7/site-packages/scrapy/http/request/__init__.py", line 58, in _set_urlraise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url: /book/?id=9780374281083
Maybe because I have to tell scrappy what is the base url? Should I add somewhere a urljoin?
EDIT_2 ---
Ok, the problem was with the urls. Adding
response.urljoin(
solved this issue.