Scraping Project Euler site with scrapy [closed]

2024/7/7 6:00:30

I'm trying to scrape projecteuler.net with python's scrapy library, just to make practice with it. I've seen online more than one existent implementation of such a scraper, but they seem just too much elaborated for me. I want simply to save the problems (titles, ids, contents) in a json and next loading with ajax in a local webpage on my pc.

I'm implementing my solution that I will terminate anyway, but since I want to discover the smarter way to use the library, I'm asking you to propose the most intelligent programs with scrapy for doing this job (if you want to avoid the json way, and save directly in html... for me may be even better).

This is my first approach (doesn't work):

# -*- coding: utf-8 -*-
import httplib2
import requests
import scrapy
from eulerscraper.items import Problem
from scrapy.linkextractors import LinkExtractor
from scrapy.loader import ItemLoader
from scrapy.spiders import CrawlSpider, Ruledef start_urls_detection():# su = ['https://projecteuler.net/archives', 'https://projecteuler.net/archives;page=2']# i = 1## while True:#     request = requests.get(su[i])##     if request.status_code != 200:#         break##     i += 1#     su.append('https://projecteuler.net/archives;page=' + str(i + 1))return ["https://projecteuler.net/"]class EulerSpider(CrawlSpider):name = 'euler'allowed_domains = ['projecteuler.net']start_urls = start_urls_detection()rules = (# Extract links matching 'category.php' (but not matching 'subsection.php')# and follow links from them (since no callback means follow=True by default).# Rule(LinkExtractor(allow=('category\.php',), deny=('subsection\.php',))),Rule(LinkExtractor(allow=('problem=\d*',)), callback="parse_problems"),Rule(LinkExtractor(allow=('archives;page=\d*',), unique=True), follow=True))def start_requests(self):# su = ['https://projecteuler.net/archives', 'https://projecteuler.net/archives;page=2']# i = 1## while True:#     request = requests.get(su[i])##     if request.status_code != 200:#         break##     i += 1#     su.append('https://projecteuler.net/archives;page=' + str(i + 1))return [scrapy.Request("https://projecteuler.net/archives", self.parse)]def parse_problems(self, response):l = ItemLoader(item=Problem(), response=response)l.add_css("title", "h2")l.add_css("id", "#problem_info")l.add_css("content", ".problem_content")yield l.load_item()# def parse_content(self, response):#     #return response.css("div.problem_content::text").extract()#     next_page = "https://projecteuler.net/archives;page=2"#     n = 3##     while n < 14:#         next_page = response.urljoin(next_page)#         yield scrapy.Request(next_page, callback=self.parse)#         next_page = next_page[0:len(next_page) - 1] + str(n)#         n += 1

now I will try with some linkExtractor + manual requests combo. In the meantime, I hopefully wait for your solutions...

Answer

I think I have found a simplest yet fitting solution (at least for my purpose), in respect to existent code written to scrape projecteuler:

# -*- coding: utf-8 -*-
import scrapy
from eulerscraper.items import Problem
from scrapy.loader import ItemLoaderclass EulerSpider(scrapy.Spider):name = 'euler'allowed_domains = ['projecteuler.net']start_urls = ["https://projecteuler.net/archives"]def parse(self, response):numpag = response.css("div.pagination a[href]::text").extract()maxpag = int(numpag[len(numpag) - 1])for href in response.css("table#problems_table a::attr(href)").extract():next_page = "https://projecteuler.net/" + hrefyield response.follow(next_page, self.parse_problems)for i in range(2, maxpag + 1):next_page = "https://projecteuler.net/archives;page=" + str(i)yield response.follow(next_page, self.parse_next)return [scrapy.Request("https://projecteuler.net/archives", self.parse)]def parse_next(self, response):for href in response.css("table#problems_table a::attr(href)").extract():next_page = "https://projecteuler.net/" + hrefyield response.follow(next_page, self.parse_problems)def parse_problems(self, response):l = ItemLoader(item=Problem(), response=response)l.add_css("title", "h2")l.add_css("id", "#problem_info")l.add_css("content", ".problem_content")yield l.load_item()

From the start page (archives) I follow every single link to a problem, scraping the data that I need with parse_problems. Then I launch the scraper for the other pages of the site, with the same procedure for every list of link. Also the Item definition with pre and post processes is very clean:

import reimport scrapy
from scrapy.loader.processors import MapCompose, Compose
from w3lib.html import remove_tagsdef extract_first_number(text):i = re.search('\d+', text)return int(text[i.start():i.end()])def array_to_value(element):return element[0]class Problem(scrapy.Item):id = scrapy.Field(input_processor=MapCompose(remove_tags, extract_first_number),output_processor=Compose(array_to_value))title = scrapy.Field(input_processor=MapCompose(remove_tags))content = scrapy.Field()

I launch this with the command scrapy crawl euler -o euler.json and it outputs an array of unordered json objects, everyone corrisponding to a single problem: this is fine for me because I'm going to process it with javascript, even if I think resolving the ordering problem via scrapy can be very simple.

EDIT: in fact it is simple, using this pipeline

import jsonclass JsonWriterPipeline(object):def open_spider(self, spider):self.list_items = []self.file = open('euler.json', 'w')def close_spider(self, spider):ordered_list = [None for i in range(len(self.list_items))]self.file.write("[\n")for i in self.list_items:ordered_list[int(i['id']-1)] = json.dumps(dict(i))for i in ordered_list:self.file.write(str(i)+",\n")self.file.write("]\n")self.file.close()def process_item(self, item, spider):self.list_items.append(item)return item

though the best solution may be to create a custom exporter:

from scrapy.exporters import JsonItemExporter
from scrapy.utils.python import to_bytesclass OrderedJsonItemExporter(JsonItemExporter):def __init__(self, file, **kwargs):# To initialize the object we use JsonItemExporter's constructorsuper().__init__(file)self.list_items = []def export_item(self, item):self.list_items.append(item)def finish_exporting(self):ordered_list = [None for i in range(len(self.list_items))]for i in self.list_items:ordered_list[int(i['id'] - 1)] = ifor i in ordered_list:if self.first_item:self.first_item = Falseelse:self.file.write(b',')self._beautify_newline()itemdict = dict(self._get_serialized_fields(i))data = self.encoder.encode(itemdict)self.file.write(to_bytes(data, self.encoding))self._beautify_newline()self.file.write(b"]")

and configure it in settings to call it for json:

FEED_EXPORTERS = {'json': 'eulerscraper.exporters.OrderedJsonItemExporter',
}
https://en.xdnf.cn/q/120366.html

Related Q&A

Python date function bugs

I am trying to create a function in python which will display the date. So I can see the program run, I have set one day to five seconds, so every five seconds it will become the next day and it will p…

Retreiving data from a website [duplicate]

This question already has answers here:How to determine the IP address of the server after connecting with urllib2?(4 answers)Closed 9 years ago.Im terribly sorry if this is unacceptable or answered e…

How to comma separate an array of integers in python? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.Want to improve this question? Add details and clarify the problem by editing this post.Closed 5 years ago.Improve…

Python 2.7.5 - Where is it installed on Windows Vista? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.This question does not appear to be about a specific programming problem, a software algorithm, or s…

Python list of lists specific path combinations or permutations

I have a list of lists and am looking for something similar to combinations or permutations, but there are conditions that may result in good "Path" or "Dead_End". If "Dead_En…

Python packages.import sys vs from sys import argv

Im trying to use argv into my python script. For this im using following code:from sys import argv script, file_name = argv print(file_name)python3 e.py e.txt This code works fine.But when I use:import…

How to reorganize a list of tuples?

Say I had a list of tuples:[(98, studentA), (97, studentB), (98, studentC), (95,studentD)]And I wanted to organize it so that the students are grouped together by the first number in the tuple, what wo…

How to loop through json data with multiple objects

My json file data.json looks like this [ {"host" : "192.168.0.25", "username":"server2", "path":"/home/server/.ssh/01_id"}, {"host"…

python django only the first statement statement can be accessed

i can acccess only the first statement in my name appjavascript:<script type="text/javascript">function searched(){{% for names in name %}nameSearched = document.getElementById(name).va…

syntax error return outside function in python

I am trying to count the word fizz using python. However it is giving me an error.def fizz_count(x):count =0 for item in x :if item== "fizz":count=count+1 return countitem= ["fizz",…