Crawl and scrape a complete site with scrapy

2024/10/12 8:27:05
import scrapy
from scrapy import Request#scrapy crawl jobs9 -o jobs9.csv -t csv
class JobsSpider(scrapy.Spider):
name = "jobs9"
allowed_domains = ["vapedonia.com"]
start_urls = ["https://www.vapedonia.com/7-principiantes-kit-s-de-inicio-", "https://www.vapedonia.com/10-cigarrillos-electronicos-", "https://www.vapedonia.com/11-mods-potencia-", "https://www.vapedonia.com/12-consumibles", "https://www.vapedonia.com/13-baterias", "https://www.vapedonia.com/23-e-liquidos", "https://www.vapedonia.com/26-accesorios", "https://www.vapedonia.com/31-atomizadores-reparables", "https://www.vapedonia.com/175-alquimia-", "https://www.vapedonia.com/284-articulos-en-liquidacion"]def parse(self, response):products = response.xpath('//div[@class="product-container clearfix"]')for product in products:image = product.xpath('div[@class="center_block"]/a/img/@src').extract_first()link = product.xpath('div[@class="center_block"]/a/@href').extract_first()name = product.xpath('div[@class="right_block"]/p/a/text()').extract_first()price = product.xpath('div[@class="right_block"]/div[@class="content_price"]/span[@class="price"]/text()').extract_first().encode("utf-8")yield{'Image' : image, 'Link' : link, 'Name': name, 'Price': price}relative_next_url = response.xpath('//*[@id="pagination_next"]/a/@href').extract_first()absolute_next_url = "https://www.vapedonia.com" + str(relative_next_url)yield Request(absolute_next_url, callback=self.parse)

with that code, I scrape correctly the products of a page and its subpages. All pages are crawled.

If I want to scrape the whole site, I must put the categories URLs manually in "start_urls". The gppd thing should me crawl those urls to make that crawl dynamic.

How can I mix crawling with scraping beyond the simple paginated crawl?

Thank you.

Now, I improve my code, here's the new code:

import scrapy
from scrapy import Request
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor#scrapy crawl jobs10 -o jobs10.csv -t csv
class JobsSpider(scrapy.spiders.CrawlSpider):
name = "jobs10"
allowed_domains = ["vapedonia.com"]
start_urls = ["https://www.vapedonia.com/"]rules = (Rule(LinkExtractor(allow=(r"https://www.vapedonia.com/\d+.*",)), callback='parse_category'), )def parse_category(self, response):products = response.xpath('//div[@class="product-container clearfix"]')for product in products:image = product.xpath('div[@class="center_block"]/a/img/@src').extract_first()link = product.xpath('div[@class="center_block"]/a/@href').extract_first()name = product.xpath('div[@class="right_block"]/p/a/text()').extract_first()price = product.xpath('div[@class="right_block"]/div[@class="content_price"]/span[@class="price"]/text()').extract_first().encode("utf-8")yield{'Image' : image, 'Link' : link, 'Name': name, 'Price': price}

The changes I've made are the following:

1- I import Crawlspider, Rule and LinkExtractor

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

2- the jobSpider class does not inherit from "scrapy.Spider" anymore. It now inherits from scrapy.spiders.CrawlSpider (which has been exported in the previous step)

3- "starts_urls" is not composed from a static list of urls anymore, we just take the domain name, so

start_urls = ["https://www.vapedonia.com/7-principiantes-kit-s-de-inicio-", "https://www.vapedonia.com/10-cigarrillos-electronicos-", "https://www.vapedonia.com/11-mods-potencia-", "https://www.vapedonia.com/12-consumibles", "https://www.vapedonia.com/13-baterias", "https://www.vapedonia.com/23-e-liquidos", "https://www.vapedonia.com/26-accesorios", "https://www.vapedonia.com/31-atomizadores-reparables", "https://www.vapedonia.com/175-alquimia-", "https://www.vapedonia.com/284-articulos-en-liquidacion"]

is replaced by

start_urls = ["https://www.vapedonia.com/"]

4- we put the rules

rules = (Rule(LinkExtractor(allow=(r"https://www.vapedonia.com/\d+.*",)), callback='parse_category'), )

we don't call "parse" anymore "but parse_category"

5- the previous pagination crawling disappear. So, the next code simply disappear

relative_next_url = response.xpath('//*[@id="pagination_next"]/a/@href').extract_first()
absolute_next_url = "https://www.vapedonia.com" + str(relative_next_url)
yield Request(absolute_next_url, callback=self.parse)

So as I see it and it seems very logical, pagination crawling process is replaced by url crawling process.

But... it does not work and even the "price" field which worked with encode("utf-8") does not work anymore.

Answer

You need to use a CrawlSpider with rules in this case. Below is a simple translated one of your scraper

class JobsSpider(scrapy.spiders.CrawlSpider):name = "jobs9"allowed_domains = ["vapedonia.com"]start_urls = ["https://www.vapedonia.com"]rules = (Rule(LinkExtractor(allow=(r"https://www.vapedonia.com/\d+.*",)), callback='parse_category'), )def parse_category(self, response):products = response.xpath('//div[@class="product-container clearfix"]')for product in products:image = product.xpath('div[@class="center_block"]/a/img/@src').extract_first()link = product.xpath('div[@class="center_block"]/a/@href').extract_first()name = product.xpath('div[@class="right_block"]/p/a/text()').extract_first()price = product.xpath('div[@class="right_block"]/div[@class="content_price"]/span[@class="price"]/text()').extract_first().encode("utf-8")yield {'Image': image, 'Link': link, 'Name': name, 'Price': price}

Look at different spiders on https://doc.scrapy.org/en/latest/topics/spiders.html

scrapy run

https://en.xdnf.cn/q/118222.html

Related Q&A

Why is pip freezing and not showing a module, although pip install says its already installed?

Im following these instructions to install Odoo on Mac. It required that I install all the Python modules for the user like so: sudo pip install -—user -r requirements.txt(*A note about the --user par…

Flatten a list of strings which contains sublists

I have a list of strings which contains a sublist os strings:ids = [uspotify:track:3ftnDaaL02tMeOZBunIwls, uspotify:track:4CKjTXDDWIrS0cwSA9scgk, [uspotify:track:6oRbm1KOqskLTFc1rvGi5F, uspotify:track:…

Portscanner producing possible error

I have written a simple portscanner in python. I have already asked something about it, you can find the code here.I corrected the code and now am able to create a connection to e.g. stackoverflow.netB…

Import error on first-party library with dev_appserver.py

On Ubuntu 16.04, am suddenly getting import errors from the local GAE development server. The local dev server starts up, including the admin interface, but app no longer loads.Native python imports o…

Split dictionary based on values

I have a dictionary:data = {cluster: A, node: B, mount: [C, D, E]}Im trying to split the dictionary data into number of dictionaries based on values in key mount.I tried using:for value in data.items()…

Using defaultdict to parse multi delimiter file

I need to parse a file which has contents that look like this:20 31022550 G 1396 =:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00 A:2:60.00:33.00:37.00:2:0:0.02:0.02:40.00:2:0.98:126.00…

Iterating in DataFrame and writing down the index of the values where a condition is met

I have a data made of 20 rows and 2500 columns. Each column is a unique product and rows are time series, results of measurements. Therefore each product is measured 20 times and there are 2500 product…

Access denied to ClearDB database using Python/Django on Heroku

Im trying to build a webapp on Heroku using Python/Django, and I just followed the tutorial to set up a Django project and push it to Heroku. However, I can never even get to the normal Django "I…

Replacing a line in a file based on a keyword search, by line from another file

Here is my file1: agadfadsdffasdfElement 1, 0, 0, 0PcomElement 2Here is my file2: PBARElement 1, 100, 200, 300, 400Element 2Continue...I want to search with a keyword, "Element 1" in file1,…

How to check for pop up alert using selenium in python

What I want is to continue with the next iteration if there is a pop up message in the webpage being scrapped. That is if there is any pop up message, I want to accept that message and go to the next i…