How to pass custom settings through CrawlerProcess in scrapy?

2024/11/15 18:14:27

I have two CrawlerProcesses, each is calling different spider. I want to pass custom settings to one of these processes to save the output of the spider to csv, I thought I could do this:

storage_settings = {'FEED_FORMAT': 'csv', 'FEED_URI': 'foo.csv'}
process = CrawlerProcess(get_project_settings())
process.crawl('ABC', crawl_links=main_links, custom_settings=storage_settings )
process.start() 

and in my spider I read them as an argument:

    def __init__(self, crawl_links=None, allowed_domains=None, customom_settings=None,  *args, **kwargs):self.start_urls = crawl_linksself.allowed_domains = allowed_domainsself.custom_settings = custom_settingsself.rules = ......super(mySpider, self).__init__(*args, **kwargs)

but how can I tell my project settings file "settings.py" about these custom settings? I don't want to hard code them, rather I want them to be read automatically.

Answer

You cannot tell your file about these settings. You are perhaps confused between crawler settings and spider settings. In scrapy, the feed paramaters as of the time of this wrting need to be passed to the crawler process and not to the spider. You have to pass them as parameters to your crawler process. I have the same use case as you. What you do is read the current project settings and then override it for each crawler process. Please see the example code below:

s = get_project_settings()
s['FEED_FORMAT'] = 'csv'
s['LOG_LEVEL'] = 'INFO'
s['FEED_URI'] = 'Q1.csv'
s['LOG_FILE'] = 'Q1.log'proc = CrawlerProcess(s)

And then your call to process.crawl() is not correct. The name of the spider should be passed as the first argument as a string, like this: process.crawl('MySpider', crawl_links=main_links) and of course MySpider should be the value given to the name attribute in your spider class.

https://en.xdnf.cn/q/71856.html

Related Q&A

numpy how to slice index an array using arrays?

Perhaps this has been raised and addressed somewhere else but I havent found it. Suppose we have a numpy array: a = np.arange(100).reshape(10,10) b = np.zeros(a.shape) start = np.array([1,4,7]) # ca…

How to import _ssl in python 2.7.6?

My http server is based on BaseHTTPServer with Python 2.7.6. Now I want it to support ssl transportation, so called https.I have installed pyOpenSSL and recompiled python source code with ssl support. …

Unexpected Indent error in Python [duplicate]

This question already has answers here:Im getting an IndentationError (or a TabError). How do I fix it?(6 answers)Closed 4 years ago.I have a simple piece of code that Im not understanding where my er…

pyshark can not capture the packet on windows 7 (python)

I want to capture the packet using pyshark. but I could not capture the packet on windows 7.this is my python codeimport pyshark def NetCap():print capturing...livecapture = pyshark.LiveCapture(interf…

How to get the Signal-to-Noise-Ratio from an image in Python?

I am filtering an image and I would like to know the SNR. I tried with the scipy function scipy.stats.signaltonoise() but I get an array of numbers and I dont really know what I am getting.Is there an…

Python and OpenCV - Cannot write readable avi video files

I have a code like this:import numpy as np import cv2cap = cv2.VideoCapture(C:/Users/Hilman/haatsu/drive_recorder/sample/3.mov)# Define the codec and create VideoWriter object fourcc = cv2.VideoWriter_…

Python as FastCGI under windows and apache

I need to run a simple request/response python module under an existing system with windows/apache/FastCGI.All the FastCGI wrappers for python I tried work for Linux only (they use socket.fromfd() and …

Scrapy shell return without response

I have a little problem with scrapy to crawl a website. I followed the tutorial of scrapy to learn how crawl a website and I was interested to test it on the site https://www.leboncoin.fr but the spide…

How to replace values using list comprehension in python3?

I was wondering how would you can replace values of a list using list comprehension. e.g. theList = [[1,2,3],[4,5,6],[7,8,9]] newList = [[1,2,3],[4,5,6],[7,8,9]] for i in range(len(theList)):for j in r…

Installed PySide but cant import it: no module named PySide

Im new to Python. I have both Python 2.7 and Python 3 installed. I just tried installing PySide via Homebrew and got this message:PySide package successfully installed in /usr/local/lib/python2.7/sit…