Scrapy shell return without response

2024/11/15 20:53:22

I have a little problem with scrapy to crawl a website. I followed the tutorial of scrapy to learn how crawl a website and I was interested to test it on the site 'https://www.leboncoin.fr' but the spider doesn't work. So, I tried :

scrapy shell 'https://www.leboncoin.fr'

But, I haven't a response of the site.

$ scrapy shell 'https://www.leboncoin.fr'
2017-05-16 08:31:26 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: all_cote)
2017-05-16 08:31:26 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'all_cote', 'DUPEFILTER_CLASS':    'scrapy.dupefilters.BaseDupeFilter', 'LOGSTATS_INTERVAL': 0,   'NEWSPIDER_MODULE': 'all_cote.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['all_cote.spiders']}
2017-05-16 08:31:27 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats','scrapy.extensions.telnet.TelnetConsole']
2017-05-16 08:31:27 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware','scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware','scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware','scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware','scrapy.downloadermiddlewares.useragent.UserAgentMiddleware','scrapy.downloadermiddlewares.retry.RetryMiddleware','scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware','scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware','scrapy.downloadermiddlewares.redirect.RedirectMiddleware','scrapy.downloadermiddlewares.cookies.CookiesMiddleware','scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-05-16 08:31:27 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware','scrapy.spidermiddlewares.offsite.OffsiteMiddleware','scrapy.spidermiddlewares.referer.RefererMiddleware','scrapy.spidermiddlewares.urllength.UrlLengthMiddleware','scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-05-16 08:31:27 [scrapy.middleware] INFO: Enabled item pipelines:[]
2017-05-16 08:31:27 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-05-16 08:31:27 [scrapy.core.engine] INFO: Spider opened
2017-05-16 08:31:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.leboncoin.fr/robots.txt> (referer: None)
2017-05-16 08:31:27 [scrapy.downloadermiddlewares.robotstxt] DEBUG: Forbidden by robots.txt: <GET https://www.leboncoin.fr>
2017-05-16 08:31:28 [traitlets] DEBUG: Using default logger
2017-05-16 08:31:28 [traitlets] DEBUG: Using default logger
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x1039fbd30>
[s]   item       {}
[s]   request    <GET https://www.leboncoin.fr>
[s]   settings   <scrapy.settings.Settings object at 0x10716b8d0>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects 
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser

If I use :

view(response)

An AttributeError is printed...

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-1-2c2544195c90> in <module>()
----> 1 view(response)/usr/local/lib/python3.6/site-packages/scrapy/utils/response.py in open_in_browser(response, _openfunc)67     from scrapy.http import HtmlResponse, TextResponse68     # XXX: this implementation is a bit dirty and could be improved
---> 69     body = response.body70     if isinstance(response, HtmlResponse):71         if b'<base' not in body:

AttributeError: 'NoneType' object has no attribute 'body'

Edit 1 :

To rrschmidt : the complete log was updated and when I run

fetch('https:www.leboncoin.fr') 

I receive this :

2017-05-16 08:33:15 [scrapy.downloadermiddlewares.robotstxt] DEBUG: Forbidden by robots.txt: <GET https://www.leboncoin.fr>

So, How can I fix it ?

Thanks for your answers,

Chris

Answer

It looks like the website has restricted scraping via robots.txt. Its usually polite to respect that wish.

But if you really want to scrape the site you can change scrapy's default behaviour by changing the ROBOTSTXT_OBEY setting to false in your settings.py

ROBOTSTXT_OBEY=False
https://en.xdnf.cn/q/71848.html

Related Q&A

How to replace values using list comprehension in python3?

I was wondering how would you can replace values of a list using list comprehension. e.g. theList = [[1,2,3],[4,5,6],[7,8,9]] newList = [[1,2,3],[4,5,6],[7,8,9]] for i in range(len(theList)):for j in r…

Installed PySide but cant import it: no module named PySide

Im new to Python. I have both Python 2.7 and Python 3 installed. I just tried installing PySide via Homebrew and got this message:PySide package successfully installed in /usr/local/lib/python2.7/sit…

How to run SQLAlchemy on AWS Lambda in Python

I preapre very simple file for connecting to external MySQL database server, like below:from sqlalchemy import *def run(event, context):sql = create_engine(mysql://root:[email protected]/scraper?chars…

saving csv file to s3 using boto3

I am trying to write and save a CSV file to a specific folder in s3 (exist). this is my code: from io import BytesIO import pandas as pd import boto3 s3 = boto3.resource(s3)d = {col1: [1, 2], col2: […

httplib2, how to set more than one cookie?

As you are probably aware, more often than not, an HTTP server will send more than just a session_id cookie; however, httplib2 handles cookies with a dictionary, like this:response, content = http.requ…

FTP upload file works manually, but fails using Python ftplib

I installed vsFTP in a Debian box. When manually upload file using ftp command, its ok. i.e, the following session works:john@myhost:~$ ftp xxx.xxx.xxx.xxx 5111 Connected to xxx.xxx.xxx.xxx. 220 Hello,…

Baktracking function which calculates change exceeds maximum recursion depth

Im trying to write a function that finds all possible combinations of coins that yield a specified amount, for example it calculates all possible way to give change for the amount 2 British pounds from…

How to interface a NumPy complex array with C function using ctypes?

I have a function in C that takes an array of complex floats and does calculations on them in-place.:/* foo.c */ void foo(cmplx_float* array, int length) {...}The complex float struct looks like this:t…

How to access predefined environment variables in conda environment.yml?

I wish to share an environment.yml file for others to reproduce the same setup as I have. The code we use depends on the environment variable $PWD. I wish to set a new env variable in the environment.y…

Python enclosing scope variables with lambda function

I wrote this simple code:def makelist():L = []for i in range(5):L.append(lambda x: i**x)return Lok, now I callmylist = makelist()because the enclosing scope variable is looked up when the nested functi…