Question 1

I have a CrawlSpider set up to following certain links and scrape a news magazine where the links to each issue follow the following URL scheme:

http://example.com/YYYY/DDDD/index.htm where YYYY is the year and DDDD is the three or four digit issue number.

I only want issues 928 onwards, and have my rules below. I don't have any problem connecting to the site, crawling links, or extracting items (so I didn't include the rest of my code). The spider seems determined to follow non-allowed links. It is trying to scrape issues 377, 398, and more, and follows the "culture.htm" and "feature.htm" links. This throws a lot of errors and isn't terribly important but it requires a lot of cleaning of the data. Any suggestions as to what is going wrong?

class crawlerNameSpider(CrawlSpider):
name = 'crawler'
allowed_domains = ["example.com"]
start_urls = ["http://example.com/issues.htm"]rules = (Rule(SgmlLinkExtractor(allow = ('\d\d\d\d/(92[8-9]|9[3-9][0-9]|\d\d\d\d)/index\.htm', )), follow = True),Rule(SgmlLinkExtractor(allow = ('fr[0-9].htm', )), callback = 'parse_item'),Rule(SgmlLinkExtractor(allow = ('eg[0-9]*.htm', )), callback = 'parse_item'),Rule(SgmlLinkExtractor(allow = ('ec[0-9]*.htm', )), callback = 'parse_item'),Rule(SgmlLinkExtractor(allow = ('op[0-9]*.htm', )), callback = 'parse_item'),Rule(SgmlLinkExtractor(allow = ('sc[0-9]*.htm', )), callback = 'parse_item'),Rule(SgmlLinkExtractor(allow = ('re[0-9]*.htm', )), callback = 'parse_item'),Rule(SgmlLinkExtractor(allow = ('in[0-9]*.htm', )), callback = 'parse_item'),Rule(SgmlLinkExtractor(deny = ('culture.htm', )), ),Rule(SgmlLinkExtractor(deny = ('feature.htm', )), ),)

EDIT: I fixed this using a much simpler regex fot 2009, 2010, 2011, but I am still curious why the above doesn't work if anyone has any suggestions.

Question 2

You need to pass deny arguments to SgmlLinkExtractor which collects links to follow. And you don't need to create so many Rule's if they call one function parse_item. I would write your code as:

rules = (Rule(SgmlLinkExtractor(allow = ('\d\d\d\d/(92[8-9]|9[3-9][0-9]|\d\d\d\d)/index\.htm', ),deny = ('culture\.htm', 'feature\.htm'),), follow = True),Rule(SgmlLinkExtractor(allow = ('fr[0-9].htm', 'eg[0-9]*.htm','ec[0-9]*.htm','op[0-9]*.htm','sc[0-9]*.htm','re[0-9]*.htm','in[0-9]*.htm',)), callback = 'parse_item',),)

If it's real url patterns in rules you are using to parse_item, it can be simplified to this:

 Rule(SgmlLinkExtractor(allow = ('(fr|eg|ec|op|sc|re|in)[0-9]*\.htm', ), callback = 'parse_item',),)

Scrapy is following and scraping non-allowed links

Related Q&A

Overriding virtual methods in PyGObject

How to find if two numbers are consecutive numbers in gray code sequence

How do I get data from selected points in an offline plotly python jupyter notebook?

Set background colour for a custom QWidget

Plotly: How to set up a color palette for a figure created with multiple traces?

Which implementation of OrderedDict should be used in python2.6?

Signal Handling in Windows

Python tkinter.filedialog askfolder interfering with clr

Run a function for each element in two lists in Pandas Dataframe Columns

cannot filter palette images error when doing a ImageEnhance.Sharpness()