How to properly use Rules, restrict_xpaths to crawl and parse URLs with scrapy?

2024/9/16 23:11:59

I am trying to program a crawl spider to crawl RSS feeds of a website and then parsing the meta tags of the article.

The first RSS page is a page that displays the RSS categories. I managed to extract the link because the tag is in a tag. It looks like this:

        <tr><td class="xmlLink"><a href="http://feeds.example.com/subject1">subject1</a></td>   </tr><tr><td class="xmlLink"><a href="http://feeds.example.com/subject2">subject2</a></td></tr>

Once you click that link it brings you the the articles for that RSS category that looks like this:

   <li class="regularitem"><h4 class="itemtitle"><a href="http://example.com/article1">article1</a></h4></li><li class="regularitem"><h4 class="itemtitle"><a href="http://example.com/article2">article2</a></h4></li>

As You can see I can get the link with xpath again if I use the tag I want my crawler to go to the link inside that tag and parse the meta tags for me.

Here is my crawler code:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from tutorial.items import exampleItemclass MetaCrawl(CrawlSpider):name = 'metaspider'start_urls = ['http://example.com/tools/rss'] # urls from which the spider will start crawlingrules = [Rule(SgmlLinkExtractor(restrict_xpaths=('//td[@class="xmlLink"]')), follow=True),Rule(SgmlLinkExtractor(restrict_xpaths=('//h4[@class="itemtitle"]')), callback='parse_articles')]def parse_articles(self, response):hxs = HtmlXPathSelector(response)meta = hxs.select('//meta')items = []for m in meta:item = exampleItem()item['link'] = response.urlitem['meta_name'] =m.select('@name').extract()item['meta_value'] = m.select('@content').extract()items.append(item)return items

However this is the output when I run the crawler:

DEBUG: Crawled (200) <GET http://http://feeds.example.com/subject1> (referer: http://example.com/tools/rss)
DEBUG: Crawled (200) <GET http://http://feeds.example.com/subject2> (referer: http://example.com/tools/rss)

What am I doing wrong here? I've been reading the documentation over and over again but I feel like I keep overlooking something. Any help would be appreciated.

EDIT: Added: items.append(item) . Had forgotten it in original post. EDIT: : I've tried this as well and it resulted in the same output:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from reuters.items import exampleItem
from scrapy.http import Requestclass MetaCrawl(CrawlSpider):name = 'metaspider'start_urls = ['http://example.com/tools/rss'] # urls from which the spider will start crawlingrules = [Rule(SgmlLinkExtractor(allow=[r'.*',], restrict_xpaths=('//td[@class="xmlLink"]')), follow=True),Rule(SgmlLinkExtractor(allow=[r'.*'], restrict_xpaths=('//h4[@class="itemtitle"]')),follow=True),]def parse(self, response):       hxs = HtmlXPathSelector(response)meta = hxs.select('//td[@class="xmlLink"]/a/@href')for m in meta:yield Request(m.extract(), callback = self.parse_link)def parse_link(self, response):       hxs = HtmlXPathSelector(response)meta = hxs.select('//h4[@class="itemtitle"]/a/@href')for m in meta:yield Request(m.extract(), callback = self.parse_again)    def parse_again(self, response):hxs = HtmlXPathSelector(response)meta = hxs.select('//meta')items = []for m in meta:item = exampleItem()item['link'] = response.urlitem['meta_name'] = m.select('@name').extract()item['meta_value'] = m.select('@content').extract()items.append(item)return items
Answer

You've returned an empty items, you need to append item to items.
You can also yield item in the loop.

https://en.xdnf.cn/q/72636.html

Related Q&A

sudo required for easy_install pip in OS X Lion?

Im coming from Snow Leopard at work to a Lion installation at home. I do NOT remember having to:sudo easy_install pipIs that required for Lion? I got errors until I did that, and pip ended up here:[…

CUDNN_STATUS_NOT_INITIALIZED when trying to run TensorFlow

I have installed TensorFlow 1.7 on an Ubuntu 16.04 with Cuda 9.0 and CuDNN 7.0.5 and vanilla Python 2.7 and although they samples for both CUDA and CuDNN run fine, and TensorFlow sees the GPU (so some …

What is the correct way to switch freely between asynchronous tasks?

Suppose I have some tasks running asynchronously. They may be totally independent, but I still want to set points where the tasks will pause so they can run concurrently. What is the correct way to run…

How to write integers to port using PySerial

I am trying to write data to the first serial port, COM1, using PySerial.import serial ser = serial.Serial(0) print (ser.name) ser.baudrate = 56700 ser.write("abcdefg") ser.close()ought to wo…

Pandas sort columns by name

I have the following dataframe, where I would like to sort the columns according to the name. 1 | 13_1 | 13_10| 13_2 | 2 | 3 9 | 31 | 2 | 1 | 3 | 4I am trying to sort the columns in the f…

Series objects are mutable, thus they cannot be hashed error calling to_csv

I have a large Dataframe (5 days with one value per second, several columns) of which Id like to save 2 columns in a csv file with python pandas df.to_csv module.I tried different ways but always get t…

Python client / server question

Im working on a bit of a project in python. I have a client and a server. The server listens for connections and once a connection is received it waits for input from the client. The idea is that the c…

Segmentation fault during import cv on Mac OS

Trying to compile opencv on my Mac from source. I have following CMakeCache.txt: http://pastebin.com/KqPHjBx0I make ccmake .., press c, then g. Than I make sudo make -j8: http://pastebin.com/cJyr1cEdTh…

Python bug - or my stupidity - EOL while scanning string literal

I cannot see a significant difference between the two following lines. Yet the first parses, and the latter, does not.In [5]: n=""" \\"Axis of Awesome\\" """In […

IOPub Error on Google Colaboratory in Jupyter Notebook

I understand that the below command jupyter notebook --NotebookApp.iopub_data_rate_limit=1.0e10 would let me set the data rate. But on Colab, I cannot run this command since the notebook is already ope…