I am trying to program a crawl spider to crawl RSS feeds of a website and then parsing the meta tags of the article.
The first RSS page is a page that displays the RSS categories. I managed to extract the link because the tag is in a tag. It looks like this:
<tr><td class="xmlLink"><a href="http://feeds.example.com/subject1">subject1</a></td> </tr><tr><td class="xmlLink"><a href="http://feeds.example.com/subject2">subject2</a></td></tr>
Once you click that link it brings you the the articles for that RSS category that looks like this:
<li class="regularitem"><h4 class="itemtitle"><a href="http://example.com/article1">article1</a></h4></li><li class="regularitem"><h4 class="itemtitle"><a href="http://example.com/article2">article2</a></h4></li>
As You can see I can get the link with xpath again if I use the tag I want my crawler to go to the link inside that tag and parse the meta tags for me.
Here is my crawler code:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from tutorial.items import exampleItemclass MetaCrawl(CrawlSpider):name = 'metaspider'start_urls = ['http://example.com/tools/rss'] # urls from which the spider will start crawlingrules = [Rule(SgmlLinkExtractor(restrict_xpaths=('//td[@class="xmlLink"]')), follow=True),Rule(SgmlLinkExtractor(restrict_xpaths=('//h4[@class="itemtitle"]')), callback='parse_articles')]def parse_articles(self, response):hxs = HtmlXPathSelector(response)meta = hxs.select('//meta')items = []for m in meta:item = exampleItem()item['link'] = response.urlitem['meta_name'] =m.select('@name').extract()item['meta_value'] = m.select('@content').extract()items.append(item)return items
However this is the output when I run the crawler:
DEBUG: Crawled (200) <GET http://http://feeds.example.com/subject1> (referer: http://example.com/tools/rss)
DEBUG: Crawled (200) <GET http://http://feeds.example.com/subject2> (referer: http://example.com/tools/rss)
What am I doing wrong here? I've been reading the documentation over and over again but I feel like I keep overlooking something. Any help would be appreciated.
EDIT: Added: items.append(item) . Had forgotten it in original post. EDIT: : I've tried this as well and it resulted in the same output:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from reuters.items import exampleItem
from scrapy.http import Requestclass MetaCrawl(CrawlSpider):name = 'metaspider'start_urls = ['http://example.com/tools/rss'] # urls from which the spider will start crawlingrules = [Rule(SgmlLinkExtractor(allow=[r'.*',], restrict_xpaths=('//td[@class="xmlLink"]')), follow=True),Rule(SgmlLinkExtractor(allow=[r'.*'], restrict_xpaths=('//h4[@class="itemtitle"]')),follow=True),]def parse(self, response): hxs = HtmlXPathSelector(response)meta = hxs.select('//td[@class="xmlLink"]/a/@href')for m in meta:yield Request(m.extract(), callback = self.parse_link)def parse_link(self, response): hxs = HtmlXPathSelector(response)meta = hxs.select('//h4[@class="itemtitle"]/a/@href')for m in meta:yield Request(m.extract(), callback = self.parse_again) def parse_again(self, response):hxs = HtmlXPathSelector(response)meta = hxs.select('//meta')items = []for m in meta:item = exampleItem()item['link'] = response.urlitem['meta_name'] = m.select('@name').extract()item['meta_value'] = m.select('@content').extract()items.append(item)return items