Scrapy Crawler in python cannot follow links?

2024/10/3 23:24:35

I wrote a crawler in python using the scrapy tool of python. The following is the python code:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
#from scrapy.item import Item
from a11ypi.items import AYpiItemclass AYpiSpider(CrawlSpider):name = "AYpi"allowed_domains = ["a11y.in"]start_urls = ["http://a11y.in/a11ypi/idea/firesafety.html"]rules =(Rule(SgmlLinkExtractor(allow = ()) ,callback = 'parse_item'))def parse_item(self,response):#filename = response.url.split("/")[-1]#open(filename,'wb').write(response.body)#testing codes ^ (the above)hxs = HtmlXPathSelector(response)item = AYpiItem()item["foruri"] = hxs.select("//@foruri").extract()item["thisurl"] = response.urlitem["thisid"] = hxs.select("//@foruri/../@id").extract()item["rec"] = hxs.select("//@foruri/../@rec").extract()return item

But, instead of following the links the error thrown is:

Traceback (most recent call last):File "/usr/lib/python2.6/site-packages/Scrapy-0.12.0.2538-py2.6.egg/scrapy/cmdline.py", line 131, in execute_run_print_help(parser, _run_command, cmd, args, opts)File "/usr/lib/python2.6/site-packages/Scrapy-0.12.0.2538-py2.6.egg/scrapy/cmdline.py", line 97, in _run_print_helpfunc(*a, **kw)File "/usr/lib/python2.6/site-packages/Scrapy-0.12.0.2538-py2.6.egg/scrapy/cmdline.py", line 138, in _run_commandcmd.run(args, opts)File "/usr/lib/python2.6/site-packages/Scrapy-0.12.0.2538-py2.6.egg/scrapy/commands/crawl.py", line 45, in runq.append_spider_name(name, **opts.spargs)
--- <exception caught here> ---File "/usr/lib/python2.6/site-packages/Scrapy-0.12.0.2538-py2.6.egg/scrapy/queue.py", line 89, in append_spider_namespider = self._spiders.create(name, **spider_kwargs)File "/usr/lib/python2.6/site-packages/Scrapy-0.12.0.2538-py2.6.egg/scrapy/spidermanager.py", line 36, in createreturn self._spiders[spider_name](**spider_kwargs)File "/usr/lib/python2.6/site-packages/Scrapy-0.12.0.2538-py2.6.egg/scrapy/contrib/spiders/crawl.py", line 38, in __init__self._compile_rules()File "/usr/lib/python2.6/site-packages/Scrapy-0.12.0.2538-py2.6.egg/scrapy/contrib/spiders/crawl.py", line 82, in _compile_rulesself._rules = [copy.copy(r) for r in self.rules]
exceptions.TypeError: 'Rule' object is not iterable

Can someone please explain to me what's going on? Since this is the stuff mentioned in the documentation and I leave the allow field blank, that itself should make follow True by default. So why the error? What kind of optimisations can I make with my crawler to make it fast?

Answer

From what I see, it looks like your rule is not an iterable. It looks like you were trying to make rules a tuple, you should read up on tuples in the python documentation.

To fix your problem, change this line:

    rules =(Rule(SgmlLinkExtractor(allow = ()) ,callback = 'parse_item'))

To:

    rules =(Rule(SgmlLinkExtractor(allow = ()) ,callback = 'parse_item'),)

Notice the comma at the end?

https://en.xdnf.cn/q/70677.html

Related Q&A

Remove commas in a string, surrounded by a comma and double quotes / Python

Ive found some similar themes on stackoverflow, but Im newbie to Python and Reg Exps.I have a string,"Completely renovated in 2009, the 2-star Superior Hotel Ibis BerlinMesse, with its 168 air-con…

I need help making a discord py temp mute command in discord py

I got my discord bot to have a mute command but you have to unmute the user yourself at a later time, I want to have another command called "tempmute" that mutes a member for a certain number…

How to clip polar plot in pylab/pyplot

I have a polar plot where theta varies from 0 to pi/2, so the whole plot lies in the first quater, like this:%pylab inline X=linspace(0,pi/2) polar(X,cos(6*X)**2)(source: schurov.com) Is it possible b…

Cython and c++ class constructors

Can someone suggest a way to manipulate c++ objects with Cython, when the c++ instance of one class is expected to feed the constructor of another wrapped class as described below? Please look at th…

How to share state when using concurrent futures

I am aware using the traditional multiprocessing library I can declare a value and share the state between processes. https://docs.python.org/3/library/multiprocessing.html?highlight=multiprocessing#s…

Does IronPython implement python standard library?

I tried IronPython some time ago and it seemed that it implements only python language, and uses .NET for libraries. Is this still the case? Can one use python modules from IronPython?

finding the last occurrence of an item in a list python

I wish to find the last occurrence of an item x in sequence s, or to return None if there is none and the position of the first item is equal to 0This is what I currently have:def PositionLast (x,s):co…

pandas cut a series with nan values

I would like to apply the pandas cut function to a series that includes NaNs. The desired behavior is that it buckets the non-NaN elements and returns NaN for the NaN-elements.import pandas as pd numbe…

Using Selenium with PyCharm CE

Im trying to use Selenium with PyCharm CE. I have installed Selenium using pip install Selenium and Im able to use it via the terminal however when I try to use it with PyCharm I get an import error Im…

Reusing generator expressions

Generator expressions is an extremely useful tool, and has a huge advantage over list comprehensions, which is the fact that it does not allocate memory for a new array.The problem I am facing with gen…