python scrapy not crawling all urls in scraped list

2024/11/18 4:28:55

I am trying to scrape information from the pages listed on this page. https://pardo.ch/pardo/program/archive/2017/catalog-films.html

the xpath selector:

film_page_urls_startpage = sel.xpath('//article[@class="strip-list_link_all strip-list strip--color row row--5"]/a/@href').extract()

correctly scrapes all 23 urls. however, the spider doesn't even appear to try crawling all 23. it crawls only 11. the same 11 each time. since I'm using selenium, I can see it just jump right over the first page/url without ever navigating to it at all. what gives?

this is my code:

from scrapy import Spider
from scrapy.http import Request
from selenium import webdriver
from scrapy.selector import Selector
from time import sleep
from selenium.common.exceptions import NoSuchElementException
from scrapy.loader import ItemLoader
from films_locarno.items import FilmsLocarnoItemfrom scrapy import class FilmsLocarnoSpiderSpider(Spider):
name = 'films_locarno_spider'
allowed_domains = ['https://pardo.ch/']
start_urls = ['https://pardo.ch/pardo/program/archive/2017/catalog-films.html']def start_requests(self):self.driver = webdriver.Firefox()self.driver.get('https://pardo.ch/pardo/program/archive/2017/catalog-films.html')sel = Selector(text=self.driver.page_source)#grab list of start pages for all 4/5 editions of festival available#list of film page urls on start page (letter A)film_page_urls_startpage = sel.xpath('//article[@class="strip-    list_link_all strip-list strip--color row row--5"]/a/@href').extract()film_page_urls_startpage_full = []for url in film_page_urls_startpage:film_page_fullurl = "https://pardo.ch" + urlfilm_page_urls_startpage_full.append(film_page_fullurl)#navigate to startpage film_pagesfor url3 in film_page_urls_startpage_full:self.driver.get(url3)sel = Selector(text=self.driver.page_source)self.logger.info('Sleeping for 1 second')sleep(1)yield Request(url3, callback=self.parse_filmpage)self.logger.info('Sleeping for 2 seconds')sleep(2) 

my output log reads [you can ignore the ERROR, its only a page navigation error, since fixed]:

    2017-12-26 09:29:33 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: films_locarno)
2017-12-26 09:29:33 [scrapy.utils.log] INFO: Overridden settings: {'SPIDER_MODULES': ['films_locarno.spiders'], 'BOT_NAME': 'films_locarno', 'NEWSPIDER_MODULE': 'films_locarno.spiders', 'FEED_URI': 'films_locarno6.csv', 'FEED_FORMAT': 'csv'}
2017-12-26 09:29:33 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.memusage.MemoryUsage','scrapy.extensions.corestats.CoreStats','scrapy.extensions.logstats.LogStats','scrapy.extensions.telnet.TelnetConsole','scrapy.extensions.feedexport.FeedExporter']
2017-12-26 09:29:33 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware','scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware','scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware','scrapy.downloadermiddlewares.useragent.UserAgentMiddleware','scrapy.downloadermiddlewares.retry.RetryMiddleware','scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware','scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware','scrapy.downloadermiddlewares.redirect.RedirectMiddleware','scrapy.downloadermiddlewares.cookies.CookiesMiddleware','scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware','scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-12-26 09:29:33 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware','scrapy.spidermiddlewares.offsite.OffsiteMiddleware','scrapy.spidermiddlewares.referer.RefererMiddleware','scrapy.spidermiddlewares.urllength.UrlLengthMiddleware','scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-12-26 09:29:33 [scrapy.middleware] INFO: Enabled item pipelines:
['scrapy.pipelines.images.ImagesPipeline']
2017-12-26 09:29:33 [scrapy.core.engine] INFO: Spider opened
2017-12-26 09:29:33 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-12-26 09:29:33 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024
2017-12-26 09:29:34 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:54941/session {"capabilities": {"firstMatch": [], "alwaysMatch": {"browserName": "firefox", "acceptInsecureCerts": true}}, "desiredCapabilities": {"browserName": "firefox", "acceptInsecureCerts": true}}
2017-12-26 09:29:41 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:29:41 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/url {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a", "url": "https://pardo.ch/pardo/program/archive/2017/catalog-films.html"}
2017-12-26 09:29:52 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:29:52 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/source {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a"}
2017-12-26 09:29:52 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:29:52 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/url {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a", "url": "https://pardo.ch/pardo/program/archive/2017/film.html?fid=955449&eid=70"}
2017-12-26 09:29:56 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:29:56 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/source {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a"}
2017-12-26 09:29:56 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:29:56 [films_locarno_spider] INFO: Sleeping for 1 second
2017-12-26 09:29:57 [films_locarno_spider] INFO: Sleeping for 2 seconds
2017-12-26 09:29:59 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/url {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a", "url": "https://pardo.ch/pardo/program/archive/2017/film.html?fid=959423&eid=70"}
2017-12-26 09:30:03 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:03 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/source {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a"}
2017-12-26 09:30:03 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:03 [films_locarno_spider] INFO: Sleeping for 1 second
2017-12-26 09:30:04 [films_locarno_spider] INFO: Sleeping for 2 seconds
2017-12-26 09:30:06 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/url {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a", "url": "https://pardo.ch/pardo/program/archive/2017/film.html?fid=968681&eid=70"}
2017-12-26 09:30:09 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:09 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/source {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a"}
2017-12-26 09:30:09 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:09 [films_locarno_spider] INFO: Sleeping for 1 second
2017-12-26 09:30:10 [films_locarno_spider] INFO: Sleeping for 2 seconds
2017-12-26 09:30:12 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/url {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a", "url": "https://pardo.ch/pardo/program/archive/2017/film.html?fid=959475&eid=70"}
2017-12-26 09:30:14 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:14 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/source {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a"}
2017-12-26 09:30:14 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:14 [films_locarno_spider] INFO: Sleeping for 1 second
2017-12-26 09:30:15 [films_locarno_spider] INFO: Sleeping for 2 seconds
2017-12-26 09:30:17 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/url {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a", "url": "https://pardo.ch/pardo/program/archive/2017/film.html?fid=960897&eid=70"}
2017-12-26 09:30:19 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:19 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/source {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a"}
2017-12-26 09:30:19 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:19 [films_locarno_spider] INFO: Sleeping for 1 second
2017-12-26 09:30:20 [films_locarno_spider] INFO: Sleeping for 2 seconds
2017-12-26 09:30:22 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/url {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a", "url": "https://pardo.ch/pardo/program/archive/2017/film.html?fid=960706&eid=70"}
2017-12-26 09:30:25 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:25 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/source {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a"}
2017-12-26 09:30:25 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:25 [films_locarno_spider] INFO: Sleeping for 1 second
2017-12-26 09:30:26 [films_locarno_spider] INFO: Sleeping for 2 seconds
2017-12-26 09:30:28 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/url {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a", "url": "https://pardo.ch/pardo/program/archive/2017/film.html?fid=929220&eid=70"}
2017-12-26 09:30:32 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:32 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/source {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a"}
2017-12-26 09:30:32 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:32 [films_locarno_spider] INFO: Sleeping for 1 second
2017-12-26 09:30:33 [films_locarno_spider] INFO: Sleeping for 2 seconds
2017-12-26 09:30:35 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/url {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a", "url": "https://pardo.ch/pardo/program/archive/2017/film.html?fid=960742&eid=70"}
2017-12-26 09:30:38 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:38 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/source {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a"}
2017-12-26 09:30:38 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:38 [films_locarno_spider] INFO: Sleeping for 1 second
2017-12-26 09:30:39 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-12-26 09:30:39 [films_locarno_spider] INFO: Sleeping for 2 seconds
2017-12-26 09:30:41 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/url {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a", "url": "https://pardo.ch/pardo/program/archive/2017/film.html?fid=960703&eid=70"}
2017-12-26 09:30:44 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:44 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/source {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a"}
2017-12-26 09:30:44 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:44 [films_locarno_spider] INFO: Sleeping for 1 second
2017-12-26 09:30:45 [films_locarno_spider] INFO: Sleeping for 2 seconds
2017-12-26 09:30:47 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/url {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a", "url": "https://pardo.ch/pardo/program/archive/2017/film.html?fid=963699&eid=70"}
2017-12-26 09:30:50 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:50 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/source {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a"}
2017-12-26 09:30:50 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:50 [films_locarno_spider] INFO: Sleeping for 1 second
2017-12-26 09:30:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://pardo.ch/pardo/program/archive/2017/film.html?fid=955449&eid=70> (referer: None)
2017-12-26 09:30:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://pardo.ch/pardo/program/archive/2017/film.html?fid=959423&eid=70> (referer: None)
2017-12-26 09:30:51 [films_locarno_spider] INFO: Sleeping for 2 seconds
2017-12-26 09:30:54 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/url {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a", "url": "https://pardo.ch/pardo/program/archive/2017/film.html?fid=964462&eid=70"}
2017-12-26 09:30:58 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:58 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/source {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a"}
2017-12-26 09:30:58 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:30:58 [films_locarno_spider] INFO: Sleeping for 1 second
2017-12-26 09:30:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://pardo.ch/pardo/program/archive/2017/film.html?fid=968681&eid=70> (referer: None)
2017-12-26 09:30:59 [films_locarno_spider] INFO: Sleeping for 3 seconds
2017-12-26 09:31:02 [films_locarno_spider] INFO: Sleeping for 3 seconds
2017-12-26 09:31:05 [films_locarno_spider] INFO: Sleeping for 2 seconds
2017-12-26 09:31:07 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:54941/session/1a43ebe2-5161-ba45-acd6-31534994c97a/url {"sessionId": "1a43ebe2-5161-ba45-acd6-31534994c97a", "url": "https://pardo.ch<a href=\"?finit=B\" class=\"dd__list__link\">B</a>"}
2017-12-26 09:31:07 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2017-12-26 09:31:07 [scrapy.core.engine] ERROR: Error while obtaining start requests
Traceback (most recent call last):File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/scrapy/core/engine.py", line 127, in _next_requestrequest = next(slot.start_requests)File "/Users/MNK1/Desktop/films_locarno/films_locarno/spiders/films_locarno_spider.py", line 48, in start_requestsself.driver.get(films_list_page)File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/selenium/webdriver/remote/webdriver.py", line 268, in getself.execute(Command.GET, {'url': url})File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/selenium/webdriver/remote/webdriver.py", line 256, in executeself.error_handler.check_response(response)File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/selenium/webdriver/remote/errorhandler.py", line 194, in check_responseraise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: Malformed URL: https://pardo.ch<a href="?finit=B" class="dd__list__link">B</a> is not a valid URL.2017-12-26 09:31:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://pardo.ch/pardo/program/archive/2017/film.html?fid=959475&eid=70> (referer: None)
2017-12-26 09:31:07 [films_locarno_spider] INFO: Sleeping for 3 seconds
2017-12-26 09:31:10 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://pardo.ch:443/mirror/get.do?q=80&url=http%3A%2F%2Fwebfiles.pardo.ch%2Fperm%2F3001%2F104%2FOC956584_P3001_233104.jpeg&w=539&h=296> referred in <None>
2017-12-26 09:31:10 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://pardo.ch:443/mirror/get.do?q=80&url=http%3A%2F%2Fwebfiles.pardo.ch%2Fperm%2F3001%2F970%2FOC960622_P3001_233970.jpg&w=539&h=296> referred in <None>
2017-12-26 09:31:10 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://pardo.ch/pardo/program/archive/2017/film.html?fid=960897&eid=70> (referer: None)
2017-12-26 09:31:10 [films_locarno_spider] INFO: Sleeping for 3 seconds
2017-12-26 09:31:13 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://pardo.ch:443/mirror/get.do?q=80&url=http%3A%2F%2Fwebfiles.pardo.ch%2Fperm%2F3001%2F430%2FOC973705_P3001_240430.jpg&w=539&h=296> referred in <None>
2017-12-26 09:31:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://pardo.ch/pardo/program/archive/2017/film.html?fid=960706&eid=70> (referer: None)
2017-12-26 09:31:13 [scrapy.core.scraper] DEBUG: Scraped from <200 https://pardo.ch/pardo/program/archive/2017/film.html?fid=955449&eid=70>
{'color': ['Color'],'country': ['Pakistan, USA'],'director': [''],'festival_edition': ['70th'],'festival_year': ['2017'],'film_year': ['2015'],'format_': ['DCP'],'image_urls': ['https://pardo.ch:443/mirror/get.do?q=80&url=http%3A%2F%2Fwebfiles.pardo.ch%2Fperm%2F3001%2F104%2FOC956584_P3001_233104.jpeg&w=539&h=296'],'images': [{'checksum': '89dd9751e436eed7ae35f980c2e10bc3','path': 'full/53cb39b642dcd6cea1e7898c9dc4777b844ea4fd.jpg','url': 'https://pardo.ch:443/mirror/get.do?q=80&url=http%3A%2F%2Fwebfiles.pardo.ch%2Fperm%2F3001%2F104%2FOC956584_P3001_233104.jpeg&w=539&h=296'}],'language': ['Urdu'],'length': ["40'"],'program': ['Open Doors: Screenings'],'title': ['A Girl in the River: The Price of Forgiveness']}
2017-12-26 09:31:13 [scrapy.core.scraper] DEBUG: Scraped from <200 https://pardo.ch/pardo/program/archive/2017/film.html?fid=959423&eid=70>
{'color': ['Color'],'country': ['Switzerland'],'director': [''],'festival_edition': ['70th'],'festival_year': ['2017'],'film_year': ['2017'],'format_': ['DCP'],'image_urls': ['https://pardo.ch:443/mirror/get.do?q=80&url=http%3A%2F%2Fwebfiles.pardo.ch%2Fperm%2F3001%2F970%2FOC960622_P3001_233970.jpg&w=539&h=296'],'images': [{'checksum': 'cce5e9ffd3bad2b359c489ac4c51c25e','path': 'full/84e0d100fc90acf2c0cfe8c38454a305e23b7408.jpg','url': 'https://pardo.ch:443/mirror/get.do?q=80&url=http%3A%2F%2Fwebfiles.pardo.ch%2Fperm%2F3001%2F970%2FOC960622_P3001_233970.jpg&w=539&h=296'}],[[edited for length]]2017-12-26 09:31:35 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 3038,'downloader/request_count': 11,'downloader/request_method_count/GET': 11,'downloader/response_bytes': 115519,'downloader/response_count': 11,'downloader/response_status_count/200': 11,'file_count': 11,'file_status_count/uptodate': 11,'finish_reason': 'finished','finish_time': datetime.datetime(2017, 12, 26, 17, 31, 35, 820684),'item_scraped_count': 11,'log_count/DEBUG': 86,'log_count/ERROR': 1,'log_count/INFO': 43,'memusage/max': 79556608,'memusage/startup': 66007040,'response_received_count': 11,'scheduler/dequeued': 11,'scheduler/dequeued/memory': 11,'scheduler/enqueued': 11,'scheduler/enqueued/memory': 11,'start_time': datetime.datetime(2017, 12, 26, 17, 29, 33, 860768)}
2017-12-26 09:31:35 [scrapy.core.engine] INFO: Spider closed (finished)
Answer

I checked this

len(film_page_urls_startpage)

and I get only 11, not 23.

If I use xpath('//article/a/@href') then I get 23 urls.

There is no need to add @class. There is no other article.


EDIT:

If I do

for item in sel.xpath('//article/@class').extract():print('class:', item)

then I get

class: strip-list_link_all strip-list strip--color row row--5 even
class: strip-list_link_all strip-list strip--color row row--5
class: strip-list_link_all strip-list strip--color row row--5 even
class: strip-list_link_all strip-list strip--color row row--5
class: strip-list_link_all strip-list strip--color row row--5 even
class: strip-list_link_all strip-list strip--color row row--5
class: strip-list_link_all strip-list strip--color row row--5 even
class: strip-list_link_all strip-list strip--color row row--5
class: strip-list_link_all strip-list strip--color row row--5 even
class: strip-list_link_all strip-list strip--color row row--5
class: strip-list_link_all strip-list strip--color row row--5 even
class: strip-list_link_all strip-list strip--color row row--5
class: strip-list_link_all strip-list strip--color row row--5 even
class: strip-list_link_all strip-list strip--color row row--5
class: strip-list_link_all strip-list strip--color row row--5 even
class: strip-list_link_all strip-list strip--color row row--5
class: strip-list_link_all strip-list strip--color row row--5 even
class: strip-list_link_all strip-list strip--color row row--5
class: strip-list_link_all strip-list strip--color row row--5 even
class: strip-list_link_all strip-list strip--color row row--5
class: strip-list_link_all strip-list strip--color row row--5 even
class: strip-list_link_all strip-list strip--color row row--5
class: strip-list_link_all strip-list strip--color row row--5 even

So some items have even in class string and this was your problem.

https://en.xdnf.cn/q/120118.html

Related Q&A

Python - Do (something) when event is near [closed]

Its difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying thi…

Python script to find nth prime number

Im new to Python and I thought Id try to learn the ropes a bit by writing a function to find the nth prime number, however I cant get my code to work properly. No doubt this is due to me missing someth…

Printing values from list within an input range

I have an unordered list, lets say:lst = [12,23,35,54,43,29,65]and the program will prompt the user to input two numbers, where these two numbers will represent the range.input1 = 22input2 = 55therefor…

An issue with the tag add command of the ttk.Treeview widget - cant handle white space

I have noticed an issue with using the tag add command of a ttk.Treeview widget when activated with the tk.call() method. That is, it cant handle white space in the value of the str() elements of its i…

How to show the ten most overdue numbers in a list

I have asked a question before about this bit of code and it was answered adequately, but I have an additional question about showing the ten most overdue numbers. (This program was a part of an in-cla…

Connect a Flask webservice from a device which is not on the same network

I am not an expert in web programming and know very little about it. I am trying to run a webservice on an EC2 instance (Windows Server 2012R2) and the webservice is written in Python using Flask packa…

why int object is not iterable while str is into python [duplicate]

This question already has answers here:Why is int" not iterable in Python, but str are?(4 answers)Closed 2 years ago.As i know we can not iterate int value while we can iterate strings in python.…

an irregular anomaly in python tuple

i create two identical tuples and use is operator on them the answer that should come is false but when i use it in vscode/atom/notepadd++ it comes true but when i use the same code in pthon run throug…

AttributeError: type object Employee has no attribute Worker

Taking a class on Python coding and trying to use inheritance to code an answer to this problem: Write an Employee class that keeps data attributes for the following piece of information: Employee name…

How to add tag for numbers which in brackets using python regex?

The default strings is:strings123[abc123def456]strings456Add tag for number:strings[abc<span>123</span>def<span>456</span>]strings