Scraping from web page and reformatting to a calender file

2024/11/16 16:20:12

I'm trying to scrape this site: http://stats.swehockey.se/ScheduleAndResults/Schedule/3940

And I've gotten as far (thanks to alecxe) as retrieving the date and teams.

from scrapy.item import Item, Field
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelectorclass SchemaItem(Item):date = Field()teams = Field()class SchemaSpider(BaseSpider):name = "schema"allowed_domains = ["http://stats.swehockey.se/"]start_urls = ["http://stats.swehockey.se/ScheduleAndResults/Schedule/3940"]def parse(self, response):hxs = HtmlXPathSelector(response)rows = hxs.select('//table[@class="tblContent"]/tr')for row in rows:item = SchemaItem()item['date'] = row.select('.//td[2]/div/span/text()').extract()item['teams'] = row.select('.//td[3]/text()').extract()yield item

So, my next step is to filter out anything that ins't a home game of "AIK" or "Djurgårdens IF". After that I'll need to reformat to an .ics file which I can add to Google Calender.

EDIT: So I've solved a few things but still has a lot to do. My code now looks like this..

# -*- coding: UTF-8 -*-
from scrapy.item import Item, Field
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelectorclass SchemaItem(Item):date = Field()teams = Field()class SchemaSpider(BaseSpider):name = "schema"allowed_domains = ["http://stats.swehockey.se/"]start_urls = ["http://stats.swehockey.se/ScheduleAndResults/Schedule/3940"]def parse(self, response):hxs = HtmlXPathSelector(response)rows = hxs.select('//table[@class="tblContent"]/tr')for row in rows:item = SchemaItem()item['date'] = row.select('.//td[2]/div/span/text()').extract()item['teams'] = row.select('.//td[3]/text()').extract()for string in item['teams']:teams = string.split('-') #split ithome_team = teams[0]#.split(' ') #only the first name, e.g. just 'Djurgårdens' out of 'Djurgårdens IF'away_team = teams[1]#home_team[0] = home_team[0].replace(" ", "") #remove whitespace#home_team = home_team[0]if "AIK" in home_team:for string in item['date']:year = string[0:4]month = string[5:7]day = string[8:10]hour = string[11:13]minute = string[14:16]print year, month, day, hour, minute, home_team, away_team  elif u"Djurgårdens" in home_team:for string in item['date']:year = string[0:4]month = string[5:7]day = string[8:10]hour = string[11:13]minute = string[14:16]print year, month, day, hour, minute, home_team, away_team     

That code prints out the games of "AIK", "Djurgårdens IF" and "Skellefteå AIK". So my problem here is obviously how to filter out "Skellefteå AIK" games and if there is any easy way to make this program better. Thoughts on this?

Best regards!

Answer

I'm just guessing that home games are the ones with the team you're looking for first (before the dash).

You can do this in XPath or from python. If you want to do it in XPath, only select the rows which contain the home team name.

//table[@class="tblContent"]/tr[contains(substring-before(.//td[3]/text(), "-"), "AIK")orcontains(substring-before(.//td[3]/text(), "-"), "Djurgårdens IF")
]

You can savely remove all whitespace (including newlines), I just added them for readability.

For python you should be able to do much the same, maybe even more concise using some regular expressions.

https://en.xdnf.cn/q/119506.html

Related Q&A

Python Text to Data Frame with Specific Pattern

I am trying to convert a bunch of text files into a data frame using Pandas. Thanks to Stack Overflows amazing community, I almost got the desired output (OP: Python Text File to Data Frame with Specif…

Python Multiprocess OpenCV Webcam Get Request [closed]

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.This question was caused by a typo or a problem that can no longer be reproduced. While similar q…

error when trying to run my tensorflow code

This is a follow up question from my latest post: Put input in a tensorflow neural network I precoded a neural network using tensorflow with the MNIST dataset, and with the help of @FinnE was able to c…

ValueError: invalid literal for int() with base 10: Height (mm)

import csv from decimal import *def mean(data_set):return Decimal(sum(data_set)) / len(data_set)def variance(data_set):mean_res = mean(data_set)differences = []squared_res = []for elem in data_set:diff…

Having trouble in getting page source with beautifulsoup

I am trying to get the HTML source of a web page using beautifulsoup. import bs4 as bs import requests import urllib.request sourceUrl=https://www.pakwheels.com/forums/t/planing-a-trip-from-karachi-to-…

Cannot convert from pandas._libs.tslibs.timestamps.Timestamp to datetime.datetime

Im trying to convert from pandas._libs.tslibs.timestamps.Timestamp to datetime.datetime but the change is not saved: type(df_unix.loc[45,LastLoginDate])OUTPUT: pandas._libs.tslibs.timestamps.Timestampt…

string index out of range Python, Django

im using Django to develop a web application. When i try and run it on my web form i am receiving string index out of range error. However, when i hardcode a dictionary in to a python test file it work…

PyQt UI event control segregation

I am a beginner in python but my OOPS concept from Java and Android are strong enough to motivate me in making some tool in python. I am using PyQt for developing the application. In my application th…

How to re-run process Linux after crash? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.This question does not appear to be about a specific programming problem, a software algorithm, or s…

Number of pairs

I am trying to write a code that takes m. a, a list of integers n. b, an integer and returns the number of pairs (m,n) with m,n in a such that |m-n|<=b. So far, Ive got this def nearest_pairs(a, b):…