Scrapy: Extracting data from source and its links

2024/10/12 18:16:26

Edited question to link to original:

Scrapy getting data from links within table

From the link https://www.tdcj.state.tx.us/death_row/dr_info/trottiewillielast.html

I am trying to get info from the main table as well as the data within the other 2 links within the table. I managed to pull from one, but question is going to the other link and appending the data in one line.

from urlparse import urljoinimport scrapyfrom texasdeath.items import DeathItemclass DeathItem(Item):firstName = Field()lastName = Field()Age = Field()Date = Field()Race = Field()County = Field()Message = Field()Passage = Field()class DeathSpider(scrapy.Spider):name = "death"allowed_domains = ["tdcj.state.tx.us"]start_urls = ["http://www.tdcj.state.tx.us/death_row/dr_executed_offenders.html"]def parse(self, response):sites = response.xpath('//table/tbody/tr')for site in sites:item = DeathItem()item['firstName'] = site.xpath('td[5]/text()').extract()item['lastName'] = site.xpath('td[4]/text()').extract()item['Age'] = site.xpath('td[7]/text()').extract()item['Date'] = site.xpath('td[8]/text()').extract()item['Race'] = site.xpath('td[9]/text()').extract()item['County'] = site.xpath('td[10]/text()').extract()url = urljoin(response.url, site.xpath("td[3]/a/@href").extract_first())url2 = urljoin(response.url, site.xpath("td[2]/a/@href").extract_first())if url.endswith("html"):request = scrapy.Request(url, meta={"item": item,"url2" : url2}, callback=self.parse_details)yield requestelse:yield item
def parse_details(self, response):item = response.meta["item"]url2 = response.meta["url2"]item['Message'] = response.xpath("//p[contains(text(), 'Last Statement')]/following-sibling::p/text()").extract()request = scrapy.Request(url2, meta={"item": item}, callback=self.parse_details2)return requestdef parse_details2(self, response):item = response.meta["item"]item['Passage'] = response.xpath("//p/text()").extract_first()return item

I understand how we pass arguments to a request and meta. But still unclear of the flow, at this point I am unsure whether this is possible or not. I have viewed several examples including the ones below:

using scrapy extracting data inside links

How can i use multiple requests and pass items in between them in scrapy python

Technically the data will reflect the main table just with both links containing data from within its link.

Appreciate any help or direction.

Answer

The problem in this case is in this piece of code

if url.endswith("html"):yield scrapy.Request(url, meta={"item": item}, callback=self.parse_details)else:yield itemif url2.endswith("html"):yield scrapy.Request(url2, meta={"item": item}, callback=self.parse_details2)else:yield item

By requesting a link you are creating a new "thread" that will take its own course of life so, the function parse_details wont be able to see what is being done in parse_details2, the way I would do it is call one within each other this way

url = urljoin(response.url, site.xpath("td[2]/a/@href").extract_first())url2 = urljoin(response.url, site.xpath("td[3]/a/@href").extract_first()if url.endswith("html"):request=scrapy.Request(url, callback=self.parse_details)request.meta['item']=itemrequest.meta['url2']=url2yield request
elif url2.endswith("html"):request=scrapy.Request(url2, callback=self.parse_details2)request.meta['item']=itemyield requestelse:yield itemdef parse_details(self, response):item = response.meta["item"]url2 = response.meta["url2"]item['About Me'] = response.xpath("//p[contains(text(), 'About Me')]/following-sibling::p/text()").extract()if url2:request=scrapy.Request(url2, callback=self.parse_details2)request.meta['item']=itemyield requestelse:yield item

This code hasn't been tested thoroughly so comment as you test

https://en.xdnf.cn/q/118171.html

Related Q&A

Rename file on upload to admin using Django

I have used a function in Django 1.6 to rename my files when they are uploaded through admin, but this does not work in Django 1.8. Anyone know if it is still possible to do this in 1.8?class Entry(mo…

Ignore newline character in binary file with Python?

I open my file like so :f = open("filename.ext", "rb") # ensure binary reading with bMy first line of data looks like this (when using f.readline()):\x04\x00\x00\x00\x12\x00\x00\x00…

RegEx Parse Error by Parsley Python

I have made a simple parser for simple queries, to fetch data from a datastore. The operands I have used are <,<=,>,>=,==,!= The Parser works fine for every operand except for < I am a b…

Accessing Bangla (UTF-8) string by index in Python

I have a string in Bangla and Im trying to access characters by index.# -*- coding: utf-8 -*- bstr = "তরদজ" print bstr # This line is working fine for i in bstr:print i, # question marks …

Computing KL divergence for many distributions

I have a matrix of test probability distributions:qs = np.array([[0.1, 0.6], [0.9, 0.4] ])(sums up to 1 in each column) and "true" distribution:p = np.array([0.5, 0.5])I would like to calcula…

Expanding mean over multiple series in pandas

I have a groupby object I apply expanding mean to. However I want that calculation over another series/group at the same time. Here is my code:d = { home : [A, B, B, A, B, A, A], away : [B, A,A, B, A, …

Moving window sum on a boollean array, with steps.

Im struggling with creating a moving window sum function that calculates the number of True values in a given numpy Boolean array my_array, with a window size of n and in jumping steps of s.For example…

Python - take the time difference from the first date in a column

Given the date column, I want to create another column diff that count how many days apart from the first date.date diff 2011-01-01 00:00:10 0 2011-01-01 00:00:11 0.000011 …

(Django) Limited ForeignKey choices by Current User

Update Thanks to Michael I was able to get this to work perfectly in my CreateView, but not in the UpdateView. When I try to set a form_class it spits out an improperly configured error. How can I go a…

Parse a custom text file in Python

I have a text to be parsed, this is a concise form of the text.apple {type=fruitvarieties {color=redorigin=usa} }the output should be as shown belowapple.type=fruit apple.varieties.color=red apple.vari…