Scrapy Images Downloading

2024/10/4 11:16:30

My spider runs without displaying any errors but the images are not stored in the folder here are my scrapy files:

Spider.py:

import scrapy
import re
import os
import urlparse
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.loader.processors import Join, MapCompose, TakeFirst
from scrapy.pipelines.images import ImagesPipeline
from production.items import ProductionItem, ListResidentialItemclass productionSpider(scrapy.Spider):name = "production"allowed_domains = ["someurl.com"]start_urls = ["someurl.com"
]def parse(self, response):for sel in response.xpath('//html/body'):item = ProductionItem()img_url = sel.xpath('//a[@data-tealium-id="detail_nav_showphotos"]/@href').extract()[0]yield scrapy.Request(urlparse.urljoin(response.url, img_url),callback=self.parseBasicListingInfo,  meta={'item': item})def parseBasicListingInfo(item, response):item = response.request.meta['item']item = ListResidentialItem()try:image_urls = map(unicode.strip,response.xpath('//a[@itemprop="contentUrl"]/@data-href').extract())item['image_urls'] = [ x for x in image_urls]except IndexError:item['image_urls'] = ''return item

settings.py:

from scrapy.settings.default_settings import ITEM_PIPELINES
from scrapy.pipelines.images import ImagesPipelineBOT_NAME = 'production'SPIDER_MODULES = ['production.spiders']
NEWSPIDER_MODULE = 'production.spiders'
DEFAULT_ITEM_CLASS = 'production.items'ROBOTSTXT_OBEY = True
DEPTH_PRIORITY = 1
IMAGE_STORE = '/images'CONCURRENT_REQUESTS = 250DOWNLOAD_DELAY = 2ITEM_PIPELINES = {'scrapy.contrib.pipeline.images.ImagesPipeline': 300,
}

items.py

# -*- coding: utf-8 -*-
import scrapyclass ProductionItem(scrapy.Item):img_url = scrapy.Field()# ScrapingList Residential & Yield Estate for sale
class ListResidentialItem(scrapy.Item):image_urls = scrapy.Field()images = scrapy.Field()pass

My pipeline file is empty i'm not sure what i am suppose to add to the pipeline.py file.

Any help is greatly appreciated.

Answer

My Working end result:

spider.py:

import scrapy
import re
import urlparse
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.loader.processors import Join, MapCompose, TakeFirst
from scrapy.pipelines.images import ImagesPipeline
from production.items import ProductionItem
from production.items import ImageItemclass productionSpider(scrapy.Spider):name = "production"allowed_domains = ["url"]start_urls = ["startingurl.com"]def parse(self, response):for sel in response.xpath('//html/body'):item = ProductionItem()img_url = sel.xpath('//a[@idd="followclaslink"]/@href').extract()[0]yield scrapy.Request(urlparse.urljoin(response.url, img_url),callback=self.parseImages,  meta={'item': item})def parseImages(self, response):for elem in response.xpath("//img"):img_url = elem.xpath("@src").extract_first()yield ImageItem(image_urls=[img_url])

Settings.py

BOT_NAME = 'production'SPIDER_MODULES = ['production.spiders']
NEWSPIDER_MODULE = 'production.spiders'
DEFAULT_ITEM_CLASS = 'production.items'
ROBOTSTXT_OBEY = True
IMAGES_STORE = '/Users/home/images'DOWNLOAD_DELAY = 2ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline': 1}
# Disable cookies (enabled by default)

items.py

# -*- coding: utf-8 -*-
import scrapyclass ProductionItem(scrapy.Item):img_url = scrapy.Field()
# ScrapingList Residential & Yield Estate for sale
class ListResidentialItem(scrapy.Item):image_urls = scrapy.Field()images = scrapy.Field()class ImageItem(scrapy.Item):image_urls = scrapy.Field()images = scrapy.Field()

pipelines.py

import scrapy
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItemclass MyImagesPipeline(ImagesPipeline):def get_media_requests(self, item, info):for image_url in item['image_urls']:yield scrapy.Request(image_url)def item_completed(self, results, item, info):image_paths = [x['path'] for ok, x in results if ok]if not image_paths:raise DropItem("Item contains no images")item['image_paths'] = image_pathsreturn item
https://en.xdnf.cn/q/70615.html

Related Q&A

A full and minimal example for a class (not method) with Python C Extension?

Everywhere, I can easily find an example about writing a method with Python C Extensions and use it in Python. Like this one: Python 3 extension example$ python3 >>> import hello >>> …

Python: Grouping into timeslots (minutes) for days of data

I have a list of events that occur at mS accurate intervals, that spans a few days. I want to cluster all the events that occur in a per-n-minutes slot (can be twenty events, can be no events). I have …

signal.alarm not triggering exception on time

Ive slightly modified the signal example from the official docs (bottom of page).Im calling sleep 10 but I would like an alarm to be raised after 1 second. When I run the following snippet it takes way…

Execute Python (selenium) script in crontab

I have read most of the python/cron here in stackoverflow and yet couldnt make my script run. I understood that I need to run my script through shell (using zsh & ipython by the way), but really I …

Get post data from ajax post request in python file

Im trying to post some data with an ajax post request and execute a python file, retrieving the data in the python file, and return a result.I have the following ajax code$(function () {$("#upload…

How to implement maclaurin series in keras?

I am trying to implement expandable CNN by using maclaurin series. The basic idea is the first input node can be decomposed into multiple nodes with different orders and coefficients. Decomposing singl…

Rowwise min() and max() fails for column with NaNs

I am trying to take the rowwise max (and min) of two columns containing datesfrom datetime import date import pandas as pd import numpy as np df = pd.DataFrame({date_a : [date(2015, 1, 1), date(2012…

Convert column suffixes from pandas join into a MultiIndex

I have two pandas DataFrames with (not necessarily) identical index and column names. >>> df_L = pd.DataFrame({X: [1, 3], Y: [5, 7]})>>> df_R = pd.DataFrame({X: [2, 4], Y: [6, 8]})I c…

sys-package-mgr*: cant create package cache dir when run python script with Jython

I want to run Python script with Jython. the result show correctly, but at the same time there is an warning message, "sys-package-mgr*: cant create package cache dir"How could I solve this p…

Python WWW macro

i need something like iMacros for Python. It would be great to have something like that:browse_to(www.google.com) type_in_input(search, query) click_button(search) list = get_all(<p>)Do you know …