Question 1

I wrote a crawler in python using the scrapy tool of python. The following is the python code:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
#from scrapy.item import Item
from a11ypi.items import AYpiItemclass AYpiSpider(CrawlSpider):name = "AYpi"allowed_domains = ["a11y.in"]start_urls = ["http://a11y.in/a11ypi/idea/firesafety.html"]rules =(Rule(SgmlLinkExtractor(allow = ()) ,callback = 'parse_item'))def parse_item(self,response):#filename = response.url.split("/")[-1]#open(filename,'wb').write(response.body)#testing codes ^ (the above)hxs = HtmlXPathSelector(response)item = AYpiItem()item["foruri"] = hxs.select("//@foruri").extract()item["thisurl"] = response.urlitem["thisid"] = hxs.select("//@foruri/../@id").extract()item["rec"] = hxs.select("//@foruri/../@rec").extract()return item

But, instead of following the links the error thrown is:

Traceback (most recent call last):File "/usr/lib/python2.6/site-packages/Scrapy-0.12.0.2538-py2.6.egg/scrapy/cmdline.py", line 131, in execute_run_print_help(parser, _run_command, cmd, args, opts)File "/usr/lib/python2.6/site-packages/Scrapy-0.12.0.2538-py2.6.egg/scrapy/cmdline.py", line 97, in _run_print_helpfunc(*a, **kw)File "/usr/lib/python2.6/site-packages/Scrapy-0.12.0.2538-py2.6.egg/scrapy/cmdline.py", line 138, in _run_commandcmd.run(args, opts)File "/usr/lib/python2.6/site-packages/Scrapy-0.12.0.2538-py2.6.egg/scrapy/commands/crawl.py", line 45, in runq.append_spider_name(name, **opts.spargs)
--- <exception caught here> ---File "/usr/lib/python2.6/site-packages/Scrapy-0.12.0.2538-py2.6.egg/scrapy/queue.py", line 89, in append_spider_namespider = self._spiders.create(name, **spider_kwargs)File "/usr/lib/python2.6/site-packages/Scrapy-0.12.0.2538-py2.6.egg/scrapy/spidermanager.py", line 36, in createreturn self._spiders[spider_name](**spider_kwargs)File "/usr/lib/python2.6/site-packages/Scrapy-0.12.0.2538-py2.6.egg/scrapy/contrib/spiders/crawl.py", line 38, in __init__self._compile_rules()File "/usr/lib/python2.6/site-packages/Scrapy-0.12.0.2538-py2.6.egg/scrapy/contrib/spiders/crawl.py", line 82, in _compile_rulesself._rules = [copy.copy(r) for r in self.rules]
exceptions.TypeError: 'Rule' object is not iterable

Can someone please explain to me what's going on? Since this is the stuff mentioned in the documentation and I leave the allow field blank, that itself should make follow True by default. So why the error? What kind of optimisations can I make with my crawler to make it fast?

Question 2

From what I see, it looks like your rule is not an iterable. It looks like you were trying to make rules a tuple, you should read up on tuples in the python documentation.

To fix your problem, change this line:

    rules =(Rule(SgmlLinkExtractor(allow = ()) ,callback = 'parse_item'))

To:

    rules =(Rule(SgmlLinkExtractor(allow = ()) ,callback = 'parse_item'),)

Notice the comma at the end?

Scrapy Crawler in python cannot follow links?

Related Q&A

Remove commas in a string, surrounded by a comma and double quotes / Python

I need help making a discord py temp mute command in discord py

How to clip polar plot in pylab/pyplot

Cython and c++ class constructors

How to share state when using concurrent futures

Does IronPython implement python standard library?

finding the last occurrence of an item in a list python

pandas cut a series with nan values

Using Selenium with PyCharm CE

Reusing generator expressions