I am building a scrapy project in which I have multiple spiders( A spider for each domain). Now, the urls to be scraped come dynamically from a user given query. so basically I do not need to do broad crawls or even follow links. there will be urls coming one after the other and I just need to extract using selectors. So I was thinking if I could just pass the URLs onto a message queue which the scrapy spider could consume from, I'd be fine. But I am not able to figure it out. I have checked
https://github.com/darkrho/scrapy-redis
but i feel its not suitable for my purposes as I need multiple queues( a single queue for each spider). As I have come to learn, one way seems to be to override the start_requests method in the spider. But here again I am not clear on what to do (new to python and scrapy). Could I just treat it as any normal python script and ovverride the method to use a(any) message queue? Also, i need the spider(s) running 24*7 and scrape whenever there is a request on the queue. I figured I should use signals and raise the DontCloseSpider exception somewhere. but where do I do that? I am pretty lost. Please help.
Here's the scenario I am looking at:
User-> Query -> url from abc.com -> abc-spider
-> url from xyz.com -> xyz-spider-> url from ghi.com -> ghi-spider
Now each url has the same thing to be scraped for each website. So i have selectors doing that in each spider. What i need is, this is just a single user scenario. when there are multiople users, there ll be multiple unrelated urls coming in for the same spider. so it will be something like this:
query1,query2, query3
abc.com -> url_abc1,url_abc2,url_abc3
xyz.com -> url_xyz1,url_xyz2,url_xyz3
ghi.com -> url_ghi1,url_ghi2, url_ghi3
so for each website, these urls will be coming dynamically which would be pushed onto their respective message queues. now each of the spiders meant for the website must consume their respective queue and give me the scraped items when there is a request on the message queue