Scrapy: Extracting data from source and its links

2024/10/12 18:16:26

Edited question to link to original:

Scrapy getting data from links within table

From the link

I am trying to get info from the main table as well as the data within the other 2 links within the table. I managed to pull from one, but question is going to the other link and appending the data in one line.

from urlparse import urljoinimport scrapyfrom texasdeath.items import DeathItemclass DeathItem(Item):firstName = Field()lastName = Field()Age = Field()Date = Field()Race = Field()County = Field()Message = Field()Passage = Field()class DeathSpider(scrapy.Spider):name = "death"allowed_domains = [""]start_urls = [""]def parse(self, response):sites = response.xpath('//table/tbody/tr')for site in sites:item = DeathItem()item['firstName'] = site.xpath('td[5]/text()').extract()item['lastName'] = site.xpath('td[4]/text()').extract()item['Age'] = site.xpath('td[7]/text()').extract()item['Date'] = site.xpath('td[8]/text()').extract()item['Race'] = site.xpath('td[9]/text()').extract()item['County'] = site.xpath('td[10]/text()').extract()url = urljoin(response.url, site.xpath("td[3]/a/@href").extract_first())url2 = urljoin(response.url, site.xpath("td[2]/a/@href").extract_first())if url.endswith("html"):request = scrapy.Request(url, meta={"item": item,"url2" : url2}, callback=self.parse_details)yield requestelse:yield item
def parse_details(self, response):item = response.meta["item"]url2 = response.meta["url2"]item['Message'] = response.xpath("//p[contains(text(), 'Last Statement')]/following-sibling::p/text()").extract()request = scrapy.Request(url2, meta={"item": item}, callback=self.parse_details2)return requestdef parse_details2(self, response):item = response.meta["item"]item['Passage'] = response.xpath("//p/text()").extract_first()return item

I understand how we pass arguments to a request and meta. But still unclear of the flow, at this point I am unsure whether this is possible or not. I have viewed several examples including the ones below:

using scrapy extracting data inside links

How can i use multiple requests and pass items in between them in scrapy python

Technically the data will reflect the main table just with both links containing data from within its link.

Appreciate any help or direction.


The problem in this case is in this piece of code

if url.endswith("html"):yield scrapy.Request(url, meta={"item": item}, callback=self.parse_details)else:yield itemif url2.endswith("html"):yield scrapy.Request(url2, meta={"item": item}, callback=self.parse_details2)else:yield item

By requesting a link you are creating a new "thread" that will take its own course of life so, the function parse_details wont be able to see what is being done in parse_details2, the way I would do it is call one within each other this way

url = urljoin(response.url, site.xpath("td[2]/a/@href").extract_first())url2 = urljoin(response.url, site.xpath("td[3]/a/@href").extract_first()if url.endswith("html"):request=scrapy.Request(url, callback=self.parse_details)request.meta['item']=itemrequest.meta['url2']=url2yield request
elif url2.endswith("html"):request=scrapy.Request(url2, callback=self.parse_details2)request.meta['item']=itemyield requestelse:yield itemdef parse_details(self, response):item = response.meta["item"]url2 = response.meta["url2"]item['About Me'] = response.xpath("//p[contains(text(), 'About Me')]/following-sibling::p/text()").extract()if url2:request=scrapy.Request(url2, callback=self.parse_details2)request.meta['item']=itemyield requestelse:yield item

This code hasn't been tested thoroughly so comment as you test

