How do I improve scrapys download speed?

2024/10/9 0:52:07

I'm using scrapy to download pages from many different domains in parallel. I have hundreds of thousands of pages to download, so performance is important.

Unfortunately, as I've profiled scrapy's speed, I'm only getting a couple pages per second. Really, about 2 pages per second on average. I've previously written my own multithreaded spiders to do hundreds of pages per second -- I thought for sure scrapy's use of twisted, etc. would be capable of similar magic.

How do I speed scrapy up? I really like the framework, but this performance issue could be a deal-breaker for me.

Here's the relevant part of the settings.py file. Is there some important setting I've missed?

LOG_ENABLED = False
CONCURRENT_REQUESTS = 100
CONCURRENT_REQUESTS_PER_IP = 8

A few parameters:

  • Using scrapy version 0.14
  • The project is deployed on an EC2 large instance, so there should be plenty of memory, CPU, and bandwidth to play with.
  • I'm scheduling crawls using the JSON protocol, keeping the crawler topped up with a few dozen concurrent crawls at any given time.
  • As I said at the beginning, I'm downloading pages from many sites, so remote server performance and CONCURRENT_REQUESTS_PER_IP shouldn't be a worry.
  • For the moment, I'm doing very little post-processing. No xpath; no regex; I'm just saving the url and a few basic statistics for each page. (This will change later once I get the basic performance kinks worked out.)
Answer

I had this problem in the past... And large part of it I solved with a 'Dirty' old tricky.

Do a local cache DNS.

Mostly when you have this high cpu usage accessing simultaneous remote sites it is because scrapy is trying to resolve the urls.

And please remember to change your dns settings at the host (/etc/resolv.conf) to your LOCAL caching DNS server.

In the first ones will be slowly, but as soon it start caching and it is more efficient resolving you are going to see HUGE improvements.

I hope this will help you in your problem!

https://en.xdnf.cn/q/70084.html

Related Q&A

Convert numpy, list or float to string in python

Im writing a python function to append data to text file, as shown in the following,The problem is the variable, var, could be a 1D numpy array, a 1D list, or just a float number, I know how to convert…

Shared XMPP connection between Celery workers

My web app needs to be able to send XMPP messages (Facebook Chat), and I thought Celery might be a good solution for this. A task would consist of querying the database and sending the XMPP message to …

List of installed fonts OS X / C

Im trying to programatically get a list of installed fonts in C or Python. I need to be able to do this on OS X, does anyone know how?

How to detect changed and new items in an RSS feed?

Using feedparser or some other Python library to download and parse RSS feeds; how can I reliably detect new items and modified items?So far I have seen new items in feeds with publication dates earli…

python SharedMemory persistence between processes

Is there any way to make SharedMemory object created in Python persist between processes? If the following code is invoked in interactive python session: >>> from multiprocessing import share…

What is the difference between syntax error and runtime error?

For example:def tofloat(i): return flt(i)def addnums(numlist):total = 0for i in numlist:total += tofloat(i)return totalnums = [1 ,2 ,3] addnums(nums)The flt is supposed to be float, but Im confused whe…

Printing a line at the bottom of the console/terminal

Using Python, I would like to print a line that will appear on the last visible line on the console the script is being ran from. For example, something like this:Would this be able to be done?

Comparing first element of the consecutive lists of tuples in Python

I have a list of tuples, each containing two elements. The first element of few sublists is common. I want to compare the first element of these sublists and append the second element in one lists. Her…

Upload a file using boto

import boto conn = boto.connect_s3(, )mybucket = conn.get_bucket(data_report_321)I can download the file from a bucket using the following code.for b in mybucket:print b.nameb.get_contents_to_filename…

How to get n-gram collocations and association in python nltk?

In this documentation, there is example using nltk.collocations.BigramAssocMeasures(), BigramCollocationFinder,nltk.collocations.TrigramAssocMeasures(), and TrigramCollocationFinder.There is example me…