Scrapy : Program organization when interacting with secondary website

2024/10/15 1:25:22

I'm working with Scrapy 1.1 and I have a project where I have spider '1' scrape site A (where I aquire 90% of the information to fill my items). However depending on the results of the Site A scrape, I may need to scrape additional information from site B. As far as developing the program, does it make more sense to scrape site B within spider '1' or would it be possible to interact site B from within a pipeline object. I prefer the latter, thinking that it decouples the scraping of 2 sites, but I'm not sure if this is possible or the best way to handle this use case. Another approach might be to use a second spider (spider '2') for site B, but then I would assume that I would have to let spider '1' run, save to db then run spider '2' . Anyway any advice would be appreciated.

Answer

Both approaches are very common and this just a question of preference. For your case containing everything in one spider sounds like a straight-forward solution.

You can add url field to your item and schedule and parse it later in the pipeline:

class MyPipeline(object):def __init__(self, crawler):self.crawler = crawler@classmethoddef from_crawler(cls, crawler):return cls(crawler)def process_item(self, item, spider):extra_url = item.get('extra_url', None)if not extra_url:return itemreq = Request(url=extra_urlcallback=self.custom_callback,meta={'item': item},)self.crawler.engine.crawl(req, spider)# you have to drop the item here since you will return it later anywayraise DropItem()def custom_callback(self, response):# retrieve your itemitem = response.mete['item']# do something to add to itemitem['some_extra_stuff'] = ...del item['extra_url'] yield item

What the above code does is checks whether item has some url field, if it does it drops the item and schedules a new request. That requests fills up the item with some extra data and sends it back to the pipeline.

https://en.xdnf.cn/q/117889.html

Related Q&A

How do I use openpyxl and still maintain OOP structure?

I am using python to do some simulations and using openpyxl to generate the reports. Now the simulation is results are to be divided into several sheets of an excel file. By the principles of OOP my st…

Leaving rows with a giving value in column

UPDATED: In my dataset I have 3 columns (x,y) and VALUE. Its looking like this(sorted already):df1: x , y ,value 1 , 1 , 12 2 , 2 , 12 4 , 3 , 12 1 , 1 , 11 2 , 2 , 11 4 , 3 , 11 1 , 1 , 33 2 , 2 , 33 …

Python Circular dependencies, unable to link variable to other file

I am working on a program that allows me to directly edit a word document through a tkinter application. I am trying to link the tkinter input from my gui file to my main file so that I can execute my …

how to use xlrd module with python for abaqus

Im working on a script for abaqus where I have to import data from an excel file to put them into my script. I already downloaded the xlrd module and it work well on python interpreter (IDLE), but when…

Property in Python with @property.getter

I have an intresting behaviour for the following code:class MyClass:def __init__(self):self.abc = 10@propertydef age(self):return self.abc@age.getterdef age(self):return self.abc + 10@age.setterdef age…

Foreign Key Access

--------------------------------------------MODELS.PY-------------------------------------------- class Artist(models.Model):name = models.CharField("artist", max_length=50) #will display &…

ValueError: could not broadcast input array from shape (22500,3) into shape (1)

I relied on the code mentioned, here, but with minor edits. The version that I have is as follows:import numpy as np import _pickle as cPickle from PIL import Image import sys,ospixels = [] labels = []…

VGG 16/19 Slow Runtimes

When I try to get an output from the pre-trained VGG 16/19 models using Caffe with Python (both 2.7 and 3.5) its taking over 15 seconds on the net.forward() step (on my laptops CPU).I was wondering if …

Numpy vs built-in copy list

what is the difference below codesbuilt-in list code>>> a = [1,2,3,4] >>> b = a[1:3] >>> b[1] = 0 >>> a [1, 2, 3, 4] >>> b [2, 0]numpy array>>> c …

Scrapy returns only first result

Im trying to scrape data from gelbeseiten.de (yellow pages in germany)# -*- coding: utf-8 -*- import scrapyfrom scrapy.spiders import CrawlSpiderfrom scrapy.http import Requestfrom scrapy.selector impo…