How can I make start_url in scrapy to consume from a message queue?

2024/9/8 8:53:23

I am building a scrapy project in which I have multiple spiders( A spider for each domain). Now, the urls to be scraped come dynamically from a user given query. so basically I do not need to do broad crawls or even follow links. there will be urls coming one after the other and I just need to extract using selectors. So I was thinking if I could just pass the URLs onto a message queue which the scrapy spider could consume from, I'd be fine. But I am not able to figure it out. I have checked

https://github.com/darkrho/scrapy-redis

but i feel its not suitable for my purposes as I need multiple queues( a single queue for each spider). As I have come to learn, one way seems to be to override the start_requests method in the spider. But here again I am not clear on what to do (new to python and scrapy). Could I just treat it as any normal python script and ovverride the method to use a(any) message queue? Also, i need the spider(s) running 24*7 and scrape whenever there is a request on the queue. I figured I should use signals and raise the DontCloseSpider exception somewhere. but where do I do that? I am pretty lost. Please help.

Here's the scenario I am looking at:

User-> Query -> url from abc.com -> abc-spider

          -> url from xyz.com -> xyz-spider-> url from ghi.com -> ghi-spider

Now each url has the same thing to be scraped for each website. So i have selectors doing that in each spider. What i need is, this is just a single user scenario. when there are multiople users, there ll be multiple unrelated urls coming in for the same spider. so it will be something like this:

query1,query2, query3

abc.com -> url_abc1,url_abc2,url_abc3

xyz.com -> url_xyz1,url_xyz2,url_xyz3

ghi.com -> url_ghi1,url_ghi2, url_ghi3

so for each website, these urls will be coming dynamically which would be pushed onto their respective message queues. now each of the spiders meant for the website must consume their respective queue and give me the scraped items when there is a request on the message queue

Answer

This is a very common and (IMO) excellent way to build scrapy as part of a data pipeline; I do it all the time.

You are correct that you want to override the spider's start_requests() method. I don't know how scrapy behaves if you have start_requests() defined as well as start_urls variable, but I'd recommend just using start_requests() if you're consuming from a dynamic source like a database.

Some example code, untested but should give you the right idea.. Please let me know if you need more information. It also assumes your queue is populated by another process.

class ProfileSpider( scrapy.Spider ):name = 'myspider'def start_requests( self ):while( True ):yield self.make_requests_from_url(self._pop_queue())def _pop_queue( self ):while( True ):yield self.queue.read()

This exposes your queue as a generator. If you want to minimize the amount of empty looping (because the queue could be empty a lot of the time), you can add a sleep command or exponential backoff in the _pop_queue loop. (If queue is empty, sleep for a few seconds and try to pop again.)

Assuming no fatal errors happen in your code, I believe this shouldn't terminate because of the construction of the loops / generators.

https://en.xdnf.cn/q/72253.html

Related Q&A

Pip install results in this error cl.exe failed with exit code 2

Ive read all of the other questions on this error and frustratingly enough, none give a solution that works. If I run pip install sentencepiece in the cmd line, it gives me the following output.src/sen…

how to communicate two separate python processes?

I have two python programs and I want to communicate them. Both of them are system services and none of them is forked by parent process.Is there any way to do this without using sockets? (eg by cra…

Why does one use of iloc() give a SettingWithCopyWarning, but the other doesnt?

Inside a method from a class i use this statement:self.__datacontainer.iloc[-1][c] = valueDoing this i get a "SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a Data…

Tkinter color name to color object

I need to modify a widgets color in some way, for example, to make it darker, greener, to invert it. The widgets color is given by name, for example, orchid4. How do I get RGB values from a color name …

Creating a TfidfVectorizer over a text column of huge pandas dataframe

I need to get matrix of TF-IDF features from the text stored in columns of a huge dataframe, loaded from a CSV file (which cannot fit in memory). I am trying to iterate over dataframe using chunks but…

Automatically convert jupyter notebook to .py

I know there have been a few questions about this but I have not found anything robust enough.Currently I am using, from terminal, a command that creates .py, then moves them to another folder:jupyter …

Schematron validation with lxml in Python: how to retrieve validation errors?

Im trying to do some Schematron validation with lxml. For the specific application Im working at, its important that any tests that failed the validation are reported back. The lxml documentation menti…

Getting Query Parameters as Dictionary in FastAPI [duplicate]

This question already has answers here:How to get query params including keys with blank values using FastAPI?(2 answers)Closed 6 months ago.I spent last month learning Flask, and am now moving on to …

Python Generated Signature for S3 Post

I think Ive read nearly everything there is to read on base-64 encoding of a signature for in-browser, form-based post to S3: old docs and new docs. For instance:http://doc.s3.amazonaws.com/proposals/…

Bringing a classifier to production

Ive saved my classifier pipeline using joblib: vec = TfidfVectorizer(sublinear_tf=True, max_df=0.5, ngram_range=(1, 3)) pac_clf = PassiveAggressiveClassifier(C=1) vec_clf = Pipeline([(vectorizer, vec)…