I'm using scrapy to download pages from many different domains in parallel. I have hundreds of thousands of pages to download, so performance is important.
Unfortunately, as I've profiled scrapy's speed, I'm only getting a couple pages per second. Really, about 2 pages per second on average. I've previously written my own multithreaded spiders to do hundreds of pages per second -- I thought for sure scrapy's use of twisted, etc. would be capable of similar magic.
How do I speed scrapy up? I really like the framework, but this performance issue could be a deal-breaker for me.
Here's the relevant part of the settings.py file. Is there some important setting I've missed?
LOG_ENABLED = False
CONCURRENT_REQUESTS = 100
CONCURRENT_REQUESTS_PER_IP = 8
A few parameters:
- Using scrapy version 0.14
- The project is deployed on an EC2 large instance, so there should be plenty of memory, CPU, and bandwidth to play with.
- I'm scheduling crawls using the JSON protocol, keeping the crawler topped up with a few dozen concurrent crawls at any given time.
- As I said at the beginning, I'm downloading pages from many sites, so remote server performance and CONCURRENT_REQUESTS_PER_IP shouldn't be a worry.
- For the moment, I'm doing very little post-processing. No xpath; no regex; I'm just saving the url and a few basic statistics for each page. (This will change later once I get the basic performance kinks worked out.)