Running dozens of Scrapy spiders in a controlled manner

2024/9/8 10:22:24

I'm trying to build a system to run a few dozen Scrapy spiders, save the results to S3, and let me know when it finishes. There are several similar questions on StackOverflow (e.g. this one and this other one), but they all seem to use the same recommendation (from the Scrapy docs): set up a CrawlerProcess, add the spiders to it, and hit start().

When I tried this method with all 325 of my spiders, though, it eventually locks up and fails because it attempts to open too many file descriptors on the system that runs it. I've tried a few things that haven't worked.

What is the recommended way to run a large number of spiders with Scrapy?

Edited to add: I understand I can scale up to multiple machines and pay for services to help coordinate (e.g. ScrapingHub), but I'd prefer to run this on one machine using some sort of process pool + queue so that only a small fixed number of spiders are ever running at the same time.

Answer

The simplest way to do this is to run them all from the command line. For example:

$ scrapy list | xargs -P 4 -n 1 scrapy crawl

Will run all your spiders, with up to 4 running in parallel at any time. You can then send a notification in a script once this command has completed.

A more robust option is to use scrapyd. This comes with an API, a minimal web interface, etc. It will also queue the crawls and only run a certain (configurable) number at once. You can interact with it via the API to start your spiders and send notifications once they are all complete.

Scrapy Cloud is a perfect fit for this [disclaimer: I work for Scrapinghub]. It will allow you only to run a certain number at once and has a queue of pending jobs (which you can modify, browse online, prioritize, etc.) and a more complete API than scrapyd.

You shouldn't run all your spiders in a single process. It will probably be slower, can introduce unforeseen bugs, and you may hit resource limits (like you did). If you run them separately using any of the options above, just run enough to max out your hardware resources (usually CPU/network). If you still get problems with file descriptors at that point you should increase the limit.

https://en.xdnf.cn/q/73034.html

Related Q&A

How to merge two DataFrame columns and apply pandas.to_datetime to it?

Im learning to use pandas, to use it for some data analysis. The data is supplied as a csv file, with several columns, of which i only need to use 4 (date, time, o, c). Ill like to create a new DataFr…

Breaking a parent function from within a child function (PHP Preferrably)

I was challenged how to break or end execution of a parent function without modifying the code of the parent, using PHPI cannot figure out any solution, other than die(); in the child, which would end …

Get absolute path of caller file

Say I have two files in different directories: 1.py (say, in C:/FIRST_FOLDER/1.py) and 2.py (say, in C:/SECOND_FOLDER/2.py).The file 1.py imports 2.py (using sys.path.insert(0, #path_of_2.py) followed,…

Pandas dataframe to excel gives file is not UTF-8 encoded

Im working on lists that I want to export into an Excel file.I found a lot of people advising to use pandas.dataframe so thats what I did. I could create the dataframe but when I try to export it to Ex…

Finding complex roots from set of non-linear equations in python

I have been testing an algorithm that has been published in literature that involves solving a set of m non-linear equations in both Matlab and Python. The set of non-linear equations involves input v…

Python on Raspberry Pi user input inside infinite loop misses inputs when hit with many

I have a very basic parrot script written in Python that simply prompts for a user input and prints it back inside an infinite loop. The Raspberry Pi has a USB barcode scanner attached for the input.wh…

How to append two bytes in python?

Say you have b\x04 and b\x00 how can you combine them as b\x0400?

Pythonic way to write a function which modifies a list?

In python function arguments are passed by object reference. This means the simplest code to modify a list will modify the object itself.a = [1,2,3]def remove_one(b):b.remove(1)remove_one(a) print(a)T…

Trying different functions until one does not throw an exception

I have some functions which try various methods to solve a problem based on a set of input data. If the problem cannot be solved by that method then the function will throw an exception.I need to try t…

Python: Extracting lists from list with module or regular expression

Im trying to extract lists/sublists from one bigger integer-list with Python2.7 by using start- and end-patterns. I would like to do it with a function, but I cant find a library, algorithm or a regula…