Scrapy is following and scraping non-allowed links

2024/9/24 14:25:14

I have a CrawlSpider set up to following certain links and scrape a news magazine where the links to each issue follow the following URL scheme: where YYYY is the year and DDDD is the three or four digit issue number.

I only want issues 928 onwards, and have my rules below. I don't have any problem connecting to the site, crawling links, or extracting items (so I didn't include the rest of my code). The spider seems determined to follow non-allowed links. It is trying to scrape issues 377, 398, and more, and follows the "culture.htm" and "feature.htm" links. This throws a lot of errors and isn't terribly important but it requires a lot of cleaning of the data. Any suggestions as to what is going wrong?

class crawlerNameSpider(CrawlSpider):
name = 'crawler'
allowed_domains = [""]
start_urls = [""]rules = (Rule(SgmlLinkExtractor(allow = ('\d\d\d\d/(92[8-9]|9[3-9][0-9]|\d\d\d\d)/index\.htm', )), follow = True),Rule(SgmlLinkExtractor(allow = ('fr[0-9].htm', )), callback = 'parse_item'),Rule(SgmlLinkExtractor(allow = ('eg[0-9]*.htm', )), callback = 'parse_item'),Rule(SgmlLinkExtractor(allow = ('ec[0-9]*.htm', )), callback = 'parse_item'),Rule(SgmlLinkExtractor(allow = ('op[0-9]*.htm', )), callback = 'parse_item'),Rule(SgmlLinkExtractor(allow = ('sc[0-9]*.htm', )), callback = 'parse_item'),Rule(SgmlLinkExtractor(allow = ('re[0-9]*.htm', )), callback = 'parse_item'),Rule(SgmlLinkExtractor(allow = ('in[0-9]*.htm', )), callback = 'parse_item'),Rule(SgmlLinkExtractor(deny = ('culture.htm', )), ),Rule(SgmlLinkExtractor(deny = ('feature.htm', )), ),)

EDIT: I fixed this using a much simpler regex fot 2009, 2010, 2011, but I am still curious why the above doesn't work if anyone has any suggestions.


You need to pass deny arguments to SgmlLinkExtractor which collects links to follow. And you don't need to create so many Rule's if they call one function parse_item. I would write your code as:

rules = (Rule(SgmlLinkExtractor(allow = ('\d\d\d\d/(92[8-9]|9[3-9][0-9]|\d\d\d\d)/index\.htm', ),deny = ('culture\.htm', 'feature\.htm'),), follow = True),Rule(SgmlLinkExtractor(allow = ('fr[0-9].htm', 'eg[0-9]*.htm','ec[0-9]*.htm','op[0-9]*.htm','sc[0-9]*.htm','re[0-9]*.htm','in[0-9]*.htm',)), callback = 'parse_item',),)

If it's real url patterns in rules you are using to parse_item, it can be simplified to this:

 Rule(SgmlLinkExtractor(allow = ('(fr|eg|ec|op|sc|re|in)[0-9]*\.htm', ), callback = 'parse_item',),)

