Question 1

I have a little problem with scrapy to crawl a website. I followed the tutorial of scrapy to learn how crawl a website and I was interested to test it on the site 'https://www.leboncoin.fr' but the spider doesn't work. So, I tried :

scrapy shell 'https://www.leboncoin.fr'

But, I haven't a response of the site.

$ scrapy shell 'https://www.leboncoin.fr'
2017-05-16 08:31:26 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: all_cote)
2017-05-16 08:31:26 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'all_cote', 'DUPEFILTER_CLASS':    'scrapy.dupefilters.BaseDupeFilter', 'LOGSTATS_INTERVAL': 0,   'NEWSPIDER_MODULE': 'all_cote.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['all_cote.spiders']}
2017-05-16 08:31:27 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats','scrapy.extensions.telnet.TelnetConsole']
2017-05-16 08:31:27 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware','scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware','scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware','scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware','scrapy.downloadermiddlewares.useragent.UserAgentMiddleware','scrapy.downloadermiddlewares.retry.RetryMiddleware','scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware','scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware','scrapy.downloadermiddlewares.redirect.RedirectMiddleware','scrapy.downloadermiddlewares.cookies.CookiesMiddleware','scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-05-16 08:31:27 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware','scrapy.spidermiddlewares.offsite.OffsiteMiddleware','scrapy.spidermiddlewares.referer.RefererMiddleware','scrapy.spidermiddlewares.urllength.UrlLengthMiddleware','scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-05-16 08:31:27 [scrapy.middleware] INFO: Enabled item pipelines:[]
2017-05-16 08:31:27 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-05-16 08:31:27 [scrapy.core.engine] INFO: Spider opened
2017-05-16 08:31:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.leboncoin.fr/robots.txt> (referer: None)
2017-05-16 08:31:27 [scrapy.downloadermiddlewares.robotstxt] DEBUG: Forbidden by robots.txt: <GET https://www.leboncoin.fr>
2017-05-16 08:31:28 [traitlets] DEBUG: Using default logger
2017-05-16 08:31:28 [traitlets] DEBUG: Using default logger
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x1039fbd30>
[s]   item       {}
[s]   request    <GET https://www.leboncoin.fr>
[s]   settings   <scrapy.settings.Settings object at 0x10716b8d0>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects 
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser

If I use :

view(response)

An AttributeError is printed...

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-1-2c2544195c90> in <module>()
----> 1 view(response)/usr/local/lib/python3.6/site-packages/scrapy/utils/response.py in open_in_browser(response, _openfunc)67     from scrapy.http import HtmlResponse, TextResponse68     # XXX: this implementation is a bit dirty and could be improved
---> 69     body = response.body70     if isinstance(response, HtmlResponse):71         if b'<base' not in body:

AttributeError: 'NoneType' object has no attribute 'body'

Edit 1 :

To rrschmidt : the complete log was updated and when I run

fetch('https:www.leboncoin.fr')

I receive this :

2017-05-16 08:33:15 [scrapy.downloadermiddlewares.robotstxt] DEBUG: Forbidden by robots.txt: <GET https://www.leboncoin.fr>

So, How can I fix it ?

Thanks for your answers,

Chris

Question 2

It looks like the website has restricted scraping via robots.txt. Its usually polite to respect that wish.

But if you really want to scrape the site you can change scrapy's default behaviour by changing the ROBOTSTXT_OBEY setting to false in your settings.py

ROBOTSTXT_OBEY=False

Scrapy shell return without response

AttributeError: 'NoneType' object has no attribute 'body'

Edit 1 :

Related Q&A

How to replace values using list comprehension in python3?

Installed PySide but cant import it: no module named PySide

How to run SQLAlchemy on AWS Lambda in Python

saving csv file to s3 using boto3

httplib2, how to set more than one cookie?

FTP upload file works manually, but fails using Python ftplib

Baktracking function which calculates change exceeds maximum recursion depth

How to interface a NumPy complex array with C function using ctypes?

How to access predefined environment variables in conda environment.yml?

Python enclosing scope variables with lambda function