Question 1

import scrapy
from scrapy import Request#scrapy crawl jobs9 -o jobs9.csv -t csv
class JobsSpider(scrapy.Spider):
name = "jobs9"
allowed_domains = ["vapedonia.com"]
start_urls = ["https://www.vapedonia.com/7-principiantes-kit-s-de-inicio-", "https://www.vapedonia.com/10-cigarrillos-electronicos-", "https://www.vapedonia.com/11-mods-potencia-", "https://www.vapedonia.com/12-consumibles", "https://www.vapedonia.com/13-baterias", "https://www.vapedonia.com/23-e-liquidos", "https://www.vapedonia.com/26-accesorios", "https://www.vapedonia.com/31-atomizadores-reparables", "https://www.vapedonia.com/175-alquimia-", "https://www.vapedonia.com/284-articulos-en-liquidacion"]def parse(self, response):products = response.xpath('//div[@class="product-container clearfix"]')for product in products:image = product.xpath('div[@class="center_block"]/a/img/@src').extract_first()link = product.xpath('div[@class="center_block"]/a/@href').extract_first()name = product.xpath('div[@class="right_block"]/p/a/text()').extract_first()price = product.xpath('div[@class="right_block"]/div[@class="content_price"]/span[@class="price"]/text()').extract_first().encode("utf-8")yield{'Image' : image, 'Link' : link, 'Name': name, 'Price': price}relative_next_url = response.xpath('//*[@id="pagination_next"]/a/@href').extract_first()absolute_next_url = "https://www.vapedonia.com" + str(relative_next_url)yield Request(absolute_next_url, callback=self.parse)

with that code, I scrape correctly the products of a page and its subpages. All pages are crawled.

If I want to scrape the whole site, I must put the categories URLs manually in "start_urls". The gppd thing should me crawl those urls to make that crawl dynamic.

How can I mix crawling with scraping beyond the simple paginated crawl?

Thank you.

Now, I improve my code, here's the new code:

import scrapy
from scrapy import Request
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor#scrapy crawl jobs10 -o jobs10.csv -t csv
class JobsSpider(scrapy.spiders.CrawlSpider):
name = "jobs10"
allowed_domains = ["vapedonia.com"]
start_urls = ["https://www.vapedonia.com/"]rules = (Rule(LinkExtractor(allow=(r"https://www.vapedonia.com/\d+.*",)), callback='parse_category'), )def parse_category(self, response):products = response.xpath('//div[@class="product-container clearfix"]')for product in products:image = product.xpath('div[@class="center_block"]/a/img/@src').extract_first()link = product.xpath('div[@class="center_block"]/a/@href').extract_first()name = product.xpath('div[@class="right_block"]/p/a/text()').extract_first()price = product.xpath('div[@class="right_block"]/div[@class="content_price"]/span[@class="price"]/text()').extract_first().encode("utf-8")yield{'Image' : image, 'Link' : link, 'Name': name, 'Price': price}

The changes I've made are the following:

1- I import Crawlspider, Rule and LinkExtractor

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

2- the jobSpider class does not inherit from "scrapy.Spider" anymore. It now inherits from scrapy.spiders.CrawlSpider (which has been exported in the previous step)

3- "starts_urls" is not composed from a static list of urls anymore, we just take the domain name, so

start_urls = ["https://www.vapedonia.com/7-principiantes-kit-s-de-inicio-", "https://www.vapedonia.com/10-cigarrillos-electronicos-", "https://www.vapedonia.com/11-mods-potencia-", "https://www.vapedonia.com/12-consumibles", "https://www.vapedonia.com/13-baterias", "https://www.vapedonia.com/23-e-liquidos", "https://www.vapedonia.com/26-accesorios", "https://www.vapedonia.com/31-atomizadores-reparables", "https://www.vapedonia.com/175-alquimia-", "https://www.vapedonia.com/284-articulos-en-liquidacion"]

is replaced by

start_urls = ["https://www.vapedonia.com/"]

4- we put the rules

rules = (Rule(LinkExtractor(allow=(r"https://www.vapedonia.com/\d+.*",)), callback='parse_category'), )

we don't call "parse" anymore "but parse_category"

5- the previous pagination crawling disappear. So, the next code simply disappear

relative_next_url = response.xpath('//*[@id="pagination_next"]/a/@href').extract_first()
absolute_next_url = "https://www.vapedonia.com" + str(relative_next_url)
yield Request(absolute_next_url, callback=self.parse)

So as I see it and it seems very logical, pagination crawling process is replaced by url crawling process.

But... it does not work and even the "price" field which worked with encode("utf-8") does not work anymore.

Question 2

You need to use a CrawlSpider with rules in this case. Below is a simple translated one of your scraper

class JobsSpider(scrapy.spiders.CrawlSpider):name = "jobs9"allowed_domains = ["vapedonia.com"]start_urls = ["https://www.vapedonia.com"]rules = (Rule(LinkExtractor(allow=(r"https://www.vapedonia.com/\d+.*",)), callback='parse_category'), )def parse_category(self, response):products = response.xpath('//div[@class="product-container clearfix"]')for product in products:image = product.xpath('div[@class="center_block"]/a/img/@src').extract_first()link = product.xpath('div[@class="center_block"]/a/@href').extract_first()name = product.xpath('div[@class="right_block"]/p/a/text()').extract_first()price = product.xpath('div[@class="right_block"]/div[@class="content_price"]/span[@class="price"]/text()').extract_first().encode("utf-8")yield {'Image': image, 'Link': link, 'Name': name, 'Price': price}

Look at different spiders on https://doc.scrapy.org/en/latest/topics/spiders.html

scrapy run

Crawl and scrape a complete site with scrapy

Related Q&A

Why is pip freezing and not showing a module, although pip install says its already installed?

Flatten a list of strings which contains sublists

Portscanner producing possible error

Import error on first-party library with dev_appserver.py

Split dictionary based on values

Using defaultdict to parse multi delimiter file

Iterating in DataFrame and writing down the index of the values where a condition is met

Access denied to ClearDB database using Python/Django on Heroku

Replacing a line in a file based on a keyword search, by line from another file

How to check for pop up alert using selenium in python