Question 1

I'm working with Scrapy 1.1 and I have a project where I have spider '1' scrape site A (where I aquire 90% of the information to fill my items). However depending on the results of the Site A scrape, I may need to scrape additional information from site B. As far as developing the program, does it make more sense to scrape site B within spider '1' or would it be possible to interact site B from within a pipeline object. I prefer the latter, thinking that it decouples the scraping of 2 sites, but I'm not sure if this is possible or the best way to handle this use case. Another approach might be to use a second spider (spider '2') for site B, but then I would assume that I would have to let spider '1' run, save to db then run spider '2' . Anyway any advice would be appreciated.

Question 2

Both approaches are very common and this just a question of preference. For your case containing everything in one spider sounds like a straight-forward solution.

You can add url field to your item and schedule and parse it later in the pipeline:

class MyPipeline(object):def __init__(self, crawler):self.crawler = crawler@classmethoddef from_crawler(cls, crawler):return cls(crawler)def process_item(self, item, spider):extra_url = item.get('extra_url', None)if not extra_url:return itemreq = Request(url=extra_urlcallback=self.custom_callback,meta={'item': item},)self.crawler.engine.crawl(req, spider)# you have to drop the item here since you will return it later anywayraise DropItem()def custom_callback(self, response):# retrieve your itemitem = response.mete['item']# do something to add to itemitem['some_extra_stuff'] = ...del item['extra_url'] yield item

What the above code does is checks whether item has some url field, if it does it drops the item and schedules a new request. That requests fills up the item with some extra data and sends it back to the pipeline.

Scrapy : Program organization when interacting with secondary website

Related Q&A

How do I use openpyxl and still maintain OOP structure?

Leaving rows with a giving value in column

Python Circular dependencies, unable to link variable to other file

how to use xlrd module with python for abaqus

Property in Python with @property.getter

Foreign Key Access

ValueError: could not broadcast input array from shape (22500,3) into shape (1)

VGG 16/19 Slow Runtimes

Numpy vs built-in copy list

Scrapy returns only first result