Processing items with Scrapy pipeline

2024/10/13 13:22:05

I'm running Scrapy from a Python script.

I was told that in Scrapy, responses are built in parse()and further processed in pipeline.py.

This is how my framework is set so far:

Python script

def script(self):process = CrawlerProcess(get_project_settings())response = process.crawl('pitchfork_albums', domain='pitchfork.com')process.start() # the script will block here until the crawling is finished

Spiders

class PitchforkAlbums(scrapy.Spider):name = "pitchfork_albums"allowed_domains = ["pitchfork.com"]#creates objects for each URL listed herestart_urls = ["http://pitchfork.com/reviews/best/albums/?page=1","http://pitchfork.com/reviews/best/albums/?page=2","http://pitchfork.com/reviews/best/albums/?page=3"                   ]def parse(self, response):for sel in response.xpath('//div[@class="album-artist"]'):item = PitchforkItem()item['artist'] = sel.xpath('//ul[@class="artist-list"]/li/text()').extract()item['album'] = sel.xpath('//h2[@class="title"]/text()').extract()yield item

items.py

class PitchforkItem(scrapy.Item):artist = scrapy.Field()album = scrapy.Field()

settings.py

ITEM_PIPELINES = {'blogs.pipelines.PitchforkPipeline': 300,
}

pipelines.py

class PitchforkPipeline(object):def __init__(self):self.file = open('tracks.jl', 'wb')def process_item(self, item, spider):line = json.dumps(dict(item)) + "\n"self.file.write(line)for i in item:return i['album'][0]

If I just return item in pipelines.py, I get data like so (one response for each htmlpage):

{'album': [u'Sirens',u'I Had a Dream That You Were Mine',u'Sunergy',u'Skeleton Tree',u'My Woman',u'JEFFERY',u'Blonde / Endless',u' A Mulher do Fim do Mundo (The Woman at the End of the World) ',u'HEAVN',u'Blank Face LP',u'blackSUMMERS\u2019night',u'Wildflower',u'Freetown Sound',u'Trans Day of Revenge',u'Puberty 2',u'Light Upon the Lake',u'iiiDrops',u'Teens of Denial',u'Coloring Book',u'A Moon Shaped Pool',u'The Colour in Anything',u'Paradise',u'HOPELESSNESS',u'Lemonade'],'artist': [u'Nicolas Jaar',u'Hamilton Leithauser',u'Rostam',u'Kaitlyn Aurelia Smith',u'Suzanne Ciani',u'Nick Cave & the Bad Seeds',u'Angel Olsen',u'Young Thug',u'Frank Ocean',u'Elza Soares',u'Jamila Woods',u'Schoolboy Q',u'Maxwell',u'The Avalanches',u'Blood Orange',u'G.L.O.S.S.',u'Mitski',u'Whitney',u'Joey Purp',u'Car Seat Headrest',u'Chance the Rapper',u'Radiohead',u'James Blake',u'White Lung',u'ANOHNI',u'Beyonc\xe9']}

What I would like to do in pipelines.py is to be able to fetch individual songs for each item, like so:

[u'Sirens']
Answer

I suggest that you build well structured item in spider. In Scrapy Framework work flow, spider is used to built well-formed item, e.g., parse html, populate item instances and pipeline is used to do operations on item, e.g., filter item, store item.

For your application, if I understand correctly, each item should be an entry to describe an album. So when paring html, you'd better build such kind of item, instead of massing everything into item.

So in your spider.py, parse function, you should

  1. Put yield item statement in the for loop, NOT OUTSIDE. In this way, each album will generate an item.
  2. Be careful about relative xpath selector in Scrapy. If you want to use relative xpath selector to specify self-and-descendant, use .// instead of //, and to specify self, use ./ instead of /.
  3. Ideally album title should be a scalar, album artist should be a list, so try extract_first to make album title to be a scalar.

    def parse(self, response):
    for sel in response.xpath('//div[@class="album-artist"]'):item = PitchforkItem()item['artist'] = sel.xpath('./ul[@class="artist-list"]/li/text()').extract_first()item['album'] = sel.xpath('./h2[@class="title"]/text()').extract()yield item
    

Hope this would be helpful.

https://en.xdnf.cn/q/118075.html

Related Q&A

How to click a button to vote with python

Im practicing with web scraping in python. Id like to press a button on a site that votes an item. Here is the code<html> <head></head> <body role="document"> <div …

Python 2.7 connection to Oracle: loosing (Polish) characters

I connect from Python 2.7 to Oracle data base. When I use:cursor.execute("SELECT column1 FROM table").fetchall()]I have got almost proper values for column1 because all Polish characters (&qu…

getting friendlist from facebook graph-api

I am trying to get users friend list from facebook Graph-api. So after getting access token when I try to open by urlopen byhttps://graph.facebook.com/facebook_id/friends?access_token=authentic_access…

Sorting Angularjs ng-repeat by date

I am relatively new to AngularJS. Could use some helpI have a table with the following info<table><tr><th><span ng-click="sortType = first_name; sortReverse = !sortReverse&quo…

Html missing when using View page source

Im trying to extract all the images from a page. I have used Mechanize Urllib and selenium to extract the Html but the part i want to extract is never there. Also when i view the page source im not abl…

Move file to a folder or make a renamed copy if it exists in the destination folder

I have a piece of code i wrote for school:import ossource = "/home/pi/lab" dest = os.environ["HOME"]for file in os.listdir(source):if file.endswith(".c")shutil.move(file,d…

Segmentation fault after removing debug printing

I have a (for me) very weird segmentation error. At first, I thought it was interference between my 4 cores due to openmp, but removing openmp from the equation is not what I want. It turns out that wh…

numpy get 2d array where last dimension is indexed according to a 2d array

I did read on numpy indexing but I didnt find what I was looking for.I have a 288*384 image, where each pixel can have a labelling in [0,15]. It is stored in a 3d (288,384,16)-shaped numpy array im.Wit…

Error sending html email with mailgun python API

I can send text email with the Mailgun python API fine:def send_simple_message(mailtext, filename=""):requests.post("https://api.mailgun.net/v3/mydomain.in/messages",auth=("api…

How to take HTML user input and query it via Python SQL?

Is there a way to take user input from HTML, and use python to run the input through to a SQL database? Does the input need to be parsed? I want the the user to be able to type in a store name, and f…