How to properly use Rules, restrict_xpaths to crawl and parse URLs with scrapy?

2024/9/16 23:11:59

I am trying to program a crawl spider to crawl RSS feeds of a website and then parsing the meta tags of the article.

The first RSS page is a page that displays the RSS categories. I managed to extract the link because the tag is in a tag. It looks like this:

        <tr><td class="xmlLink"><a href="">subject1</a></td>   </tr><tr><td class="xmlLink"><a href="">subject2</a></td></tr>

Once you click that link it brings you the the articles for that RSS category that looks like this:

   <li class="regularitem"><h4 class="itemtitle"><a href="">article1</a></h4></li><li class="regularitem"><h4 class="itemtitle"><a href="">article2</a></h4></li>

As You can see I can get the link with xpath again if I use the tag I want my crawler to go to the link inside that tag and parse the meta tags for me.

Here is my crawler code:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from tutorial.items import exampleItemclass MetaCrawl(CrawlSpider):name = 'metaspider'start_urls = [''] # urls from which the spider will start crawlingrules = [Rule(SgmlLinkExtractor(restrict_xpaths=('//td[@class="xmlLink"]')), follow=True),Rule(SgmlLinkExtractor(restrict_xpaths=('//h4[@class="itemtitle"]')), callback='parse_articles')]def parse_articles(self, response):hxs = HtmlXPathSelector(response)meta ='//meta')items = []for m in meta:item = exampleItem()item['link'] = response.urlitem['meta_name']'@name').extract()item['meta_value'] ='@content').extract()items.append(item)return items

However this is the output when I run the crawler:

DEBUG: Crawled (200) <GET http://> (referer:
DEBUG: Crawled (200) <GET http://> (referer:

What am I doing wrong here? I've been reading the documentation over and over again but I feel like I keep overlooking something. Any help would be appreciated.

EDIT: Added: items.append(item) . Had forgotten it in original post. EDIT: : I've tried this as well and it resulted in the same output:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from reuters.items import exampleItem
from scrapy.http import Requestclass MetaCrawl(CrawlSpider):name = 'metaspider'start_urls = [''] # urls from which the spider will start crawlingrules = [Rule(SgmlLinkExtractor(allow=[r'.*',], restrict_xpaths=('//td[@class="xmlLink"]')), follow=True),Rule(SgmlLinkExtractor(allow=[r'.*'], restrict_xpaths=('//h4[@class="itemtitle"]')),follow=True),]def parse(self, response):       hxs = HtmlXPathSelector(response)meta ='//td[@class="xmlLink"]/a/@href')for m in meta:yield Request(m.extract(), callback = self.parse_link)def parse_link(self, response):       hxs = HtmlXPathSelector(response)meta ='//h4[@class="itemtitle"]/a/@href')for m in meta:yield Request(m.extract(), callback = self.parse_again)    def parse_again(self, response):hxs = HtmlXPathSelector(response)meta ='//meta')items = []for m in meta:item = exampleItem()item['link'] = response.urlitem['meta_name'] ='@name').extract()item['meta_value'] ='@content').extract()items.append(item)return items

You've returned an empty items, you need to append item to items.
You can also yield item in the loop.

