Scrapy is following and scraping non-allowed links

2024/9/24 14:25:14

I have a CrawlSpider set up to following certain links and scrape a news magazine where the links to each issue follow the following URL scheme:

http://example.com/YYYY/DDDD/index.htm where YYYY is the year and DDDD is the three or four digit issue number.

I only want issues 928 onwards, and have my rules below. I don't have any problem connecting to the site, crawling links, or extracting items (so I didn't include the rest of my code). The spider seems determined to follow non-allowed links. It is trying to scrape issues 377, 398, and more, and follows the "culture.htm" and "feature.htm" links. This throws a lot of errors and isn't terribly important but it requires a lot of cleaning of the data. Any suggestions as to what is going wrong?

class crawlerNameSpider(CrawlSpider):
name = 'crawler'
allowed_domains = ["example.com"]
start_urls = ["http://example.com/issues.htm"]rules = (Rule(SgmlLinkExtractor(allow = ('\d\d\d\d/(92[8-9]|9[3-9][0-9]|\d\d\d\d)/index\.htm', )), follow = True),Rule(SgmlLinkExtractor(allow = ('fr[0-9].htm', )), callback = 'parse_item'),Rule(SgmlLinkExtractor(allow = ('eg[0-9]*.htm', )), callback = 'parse_item'),Rule(SgmlLinkExtractor(allow = ('ec[0-9]*.htm', )), callback = 'parse_item'),Rule(SgmlLinkExtractor(allow = ('op[0-9]*.htm', )), callback = 'parse_item'),Rule(SgmlLinkExtractor(allow = ('sc[0-9]*.htm', )), callback = 'parse_item'),Rule(SgmlLinkExtractor(allow = ('re[0-9]*.htm', )), callback = 'parse_item'),Rule(SgmlLinkExtractor(allow = ('in[0-9]*.htm', )), callback = 'parse_item'),Rule(SgmlLinkExtractor(deny = ('culture.htm', )), ),Rule(SgmlLinkExtractor(deny = ('feature.htm', )), ),)

EDIT: I fixed this using a much simpler regex fot 2009, 2010, 2011, but I am still curious why the above doesn't work if anyone has any suggestions.

Answer

You need to pass deny arguments to SgmlLinkExtractor which collects links to follow. And you don't need to create so many Rule's if they call one function parse_item. I would write your code as:

rules = (Rule(SgmlLinkExtractor(allow = ('\d\d\d\d/(92[8-9]|9[3-9][0-9]|\d\d\d\d)/index\.htm', ),deny = ('culture\.htm', 'feature\.htm'),), follow = True),Rule(SgmlLinkExtractor(allow = ('fr[0-9].htm', 'eg[0-9]*.htm','ec[0-9]*.htm','op[0-9]*.htm','sc[0-9]*.htm','re[0-9]*.htm','in[0-9]*.htm',)), callback = 'parse_item',),)

If it's real url patterns in rules you are using to parse_item, it can be simplified to this:

 Rule(SgmlLinkExtractor(allow = ('(fr|eg|ec|op|sc|re|in)[0-9]*\.htm', ), callback = 'parse_item',),)
https://en.xdnf.cn/q/71692.html

Related Q&A

Overriding virtual methods in PyGObject

Im trying to implement the Heigh-for-width Geometry Management in GTK with Python for my custom Widget. My widget is a subclass from Gtk.DrawingArea and draws some parts of an Image.As I understood the…

How to find if two numbers are consecutive numbers in gray code sequence

I am trying to come up with a solution to the problem that given two numbers, find if they are the consecutive numbers in the gray code sequence i.e., if they are gray code neighbors assuming that the …

How do I get data from selected points in an offline plotly python jupyter notebook?

Example code:from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplotimport plotly.graph_objs as goimport numpy as npN = 30 random_x = np.random.randn(N) random_y = np.random.randn…

Set background colour for a custom QWidget

I am attempting to create a custom QWidget (from PyQt5) whose background colour can change. However, all the standard methods of setting the background colour do not seem to work for a custom QWidget c…

Plotly: How to set up a color palette for a figure created with multiple traces?

I using code below to generate chart with multiple traces. However the only way that i know to apply different colours for each trace is using a randon function that ger a numerico RGB for color. But r…

Which implementation of OrderedDict should be used in python2.6?

As some of you may know in python2.7/3.2 well get OrderedDict with PEP372 however one of the reason the PEP existed was because everyone did their own implementation and they were all sightly incompati…

Signal Handling in Windows

In Windows I am trying to create a python process that waits for SIGINT signal.And when it receives SIGINT I want it to just print a message and wait for another occurrence of SIGINT.So I used signal h…

Python tkinter.filedialog askfolder interfering with clr

Im mainly working in Spyder, building scripts that required a pop-up folder or file Browse window.The code below works perfect in spyder. In Pycharm, the askopenfilename working well, while askdirector…

Run a function for each element in two lists in Pandas Dataframe Columns

df: col1 [aa, bb, cc, dd] [this, is, a, list, 2] [this, list, 3]col2 [[ee, ff, gg, hh], [qq, ww, ee, rr]] [[list, a, not, 1], [not, is, this, 2]] [[this, is, list, not], [a, not, list, 2]]What Im tryin…

cannot filter palette images error when doing a ImageEnhance.Sharpness()

I have a GIF image file. I opened it using PIL.Image and did a couple of size transforms on it. Then I tried to use ImageSharpness.Enhance() on it...sharpener = PIL.ImageEnhance.Sharpness(img) sharpene…