scraping : nested url data scraping

2024/10/15 3:11:38

I have a website name https://www.grohe.com/in In that page i want to get one type of bathroom faucets https://www.grohe.com/in/25796/bathroom/bathroom-faucets/grandera/ In that page there are multiple products/related products.I want to get each product url and scrap the data.For that i wrote like this...

My items.py file looks like

from scrapy.item import Item, Fieldclass ScrapytestprojectItem(Item):producturl=Field()imageurl=Field()description=Field()

spider code is

import scrapy
from ScrapyTestProject.items import ScrapytestprojectItem
class QuotesSpider(scrapy.Spider):name = "nestedurl"allowed_domains = ['www.grohe.com']start_urls = ['https://www.grohe.com/in/7780/bathroom/bathroom-faucets/essence/',]def parse(self, response):for divs in response.css('div.viewport div.workspace div.float-box'):item = {'producturl': divs.css('a::attr(href)').extract(),'imageurl': divs.css('a img::attr(src)').extract(),'description' : divs.css('a div.text::text').extract() + divs.css('a span.nowrap::text').extract()}next_page = response.urljoin(item['producturl'])yield scrapy.Request(next_page, callback=self.parse, meta={'item': item})

when i ran the scrapy **scrapy crawl nestedurl -o nestedurl.csv ** empty file created. The console is

2017-02-15 18:03:11 [scrapy] DEBUG: Telnet console listening on    127.0.0.1:6024
2017-02-15 18:03:13 [scrapy] DEBUG: Crawled (200) <GET  https://www.grohe.com/in/7780/bathroom/bathroom-faucets/essence/>  (referer: None)
2017-02-15 18:03:13 [scrapy] ERROR: Spider error processing <GET   https://www.grohe.com/in/7780/bathroom/bathroom-faucets/essence/>   (referer: None)
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)File "/usr/lib/python2.7/dist-        packages/scrapy/spidermiddlewares/offsite.py", line 28, in     process_spider_outputfor x in result:File "/usr/lib/python2.7/dist-    packages/scrapy/spidermiddlewares/referer.py", line 22, in <genexpr>return (_set_referer(r) for r in result or ())File "/usr/lib/python2.7/dist-     packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>return (r for r in result or () if _filter(r))File "/usr/lib/python2.7/dist-  packages/scrapy/spidermiddlewares/depth.py", line 54, in <genexpr>return (r for r in result or () if _filter(r))File    "/home/pradeep/ScrapyTestProject/ScrapyTestProject/spiders/nestedurl.py",    line 15, in parsenext_page = response.urljoin(item['producturl'])File "/usr/lib/python2.7/dist-packages/scrapy/http/response/text.py",    line 72, in urljoinreturn urljoin(get_base_url(self), url)File "/usr/lib/python2.7/urlparse.py", line 261, in urljoinurlparse(url, bscheme, allow_fragments)File "/usr/lib/python2.7/urlparse.py", line 143, in urlparsetuple = urlsplit(url, scheme, allow_fragments)File "/usr/lib/python2.7/urlparse.py", line 176, in urlsplitcached = _parse_cache.get(key, None)TypeError: unhashable type: 'list'2017-02-15 18:03:13 [scrapy] INFO: Closing spider (finished)2017-02-15 18:03:13 [scrapy] INFO: Dumping Scrapy stats:{'downloader/request_bytes': 253,'downloader/request_count': 1,'downloader/request_method_count/GET': 1,'downloader/response_bytes': 31063,'downloader/response_count': 1,'downloader/response_status_count/200': 1,'finish_reason': 'finished','finish_time': datetime.datetime(2017, 2, 15, 12, 33, 13, 396542),'log_count/DEBUG': 3,'log_count/ERROR': 3,'log_count/INFO': 7,'response_received_count': 1,'scheduler/dequeued': 1,'scheduler/dequeued/memory': 1,'scheduler/enqueued': 1,'scheduler/enqueued/memory': 1,'spider_exceptions/TypeError': 1,'start_time': datetime.datetime(2017, 2, 15, 12, 33, 11, 568424)}2017-02-15 18:03:13 [scrapy] INFO: Spider closed (finished)
Answer

I think item divs.css('a::attr(href)').extract() sometimes returns a list which when used in urljoin leads which causes urlparse to crash as it can not hash a list.

https://en.xdnf.cn/q/117875.html

Related Q&A

How to trigger an action once on overscroll in Kivy?

I have a ScrollView thats supposed to have an update feature when you overscroll to the top (like in many apps). Ive found a way to trigger it when the overscroll exceeds a certain threshold, but it tr…

Python - Print Each Sentence On New Line

Per the subject, Im trying to print each sentence in a string on a new line. With the current code and output shown below, whats the syntax to return "Correct Output" shown below?Codesentenc…

pyinstaller struct.error: unpack requires a bytes object of length 16 [closed]

Closed. This question needs debugging details. It is not currently accepting answers.Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to repro…

Getting the quarter where recession start and recession ends along with the quarter of minimum gdp

Quarter: GDP: GDP change: change 1999q3 9 -- ------ 1999q4 10 1 increase 2000q1 9 -1 decline 2000q2 8 -1 de…

Inherit view and adding fields

I want to add my 2 fields boatlenght and fuelcapacity under price list in product form view but they are not showing up. What did i miss.<?xml version="1.0" encoding="utf-8"?&g…

Linux and python: Combining multiple wave files to one wave file

I am looking for a way that I can combine multiple wave files into one wave file using python and run it on linux. I dont want to use any add on other than the default shell command line and default py…

How does the in operator determine membership? [duplicate]

This question already has answers here:Set "in" operator: uses equality or identity?(5 answers)Closed 7 years ago.How does the in operator work for Python? In the example below I have two n…

Python Automatically ignore unicode string

Ive been searching to automatically import some files but since Im on Windows i got the unicode error (because of the "C:\Users\..."). Ive been looking to correct this error and found some h…

How to obtain currency rates from this website converter widget python

How can I implement the currency rates on this website and keep the currencies up to date so that i can access them in python from this website and input and output values and currencies types. I need …

Trying to add sums from a csv file in python

I need to add sums of a csv file. The program is a test for a travel reservation system and the file reads like this:availableSTART,reservations,cancellations,availableEND 20,1,0,18I need to subtract r…