Web Scrape page with multiple sections

2024/10/16 1:22:14

Pretty new to python... and I'm trying to my hands at my first project.

Been able to replicate few simple demo... but i think there are few extra complexities with what I'm trying to do.

I'm trying to scrape the gamelogs for from the NHL website

Here is that i came up with... similar code work for the top section of the site (ex: get the age) but it fail on the section with display logic (dependent if the user click on Career, game Logs or splits)

Thanks in advance for your help

import urllib2
from bs4 import BeautifulSoupurl = 'https://www.nhl.com/player/ryan-getzlaf-8470612?stats=gamelogs-r-nhl&season=20162017'page = urllib2.urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
Test = soup.find_all('div', attrs={'id': "gamelogsTable"})
Answer

This happens with many web pages. It's because some of the content is downloaded by Javascript code that is part of the initial download. By doing does this designers are able to show visitors the most important parts of a page without waiting for the entire page to download.

When you want to scrape a page the first thing you should do is to examine the source code for it (often using Ctrl-u in a Windows environment) to see if the content you require is available. If not then you will need to use something beyond BeautifulSoup.

>>> getzlafURL = 'https://www.nhl.com/player/ryan-getzlaf-8470612?stats=gamelogs-r-nhl&season=20162017'
>>> import requests
>>> import selenium.webdriver as webdriver
>>> import lxml.html as html
>>> import lxml.html.clean as clean
>>> browser = webdriver.Chrome()
>>> browser.get(getzlafURL)
>>> content = browser.page_source
>>> cleaner = clean.Cleaner()
>>> content = cleaner.clean_html(content)
>>> doc = html.fromstring(content)
>>> type(doc)
<class 'lxml.html.HtmlElement'>
>>> open('c:/scratch/temp.htm', 'w').write(content)
775838

By searching within the file temp.htm for the heading 'Ryan Getzlaf Game Logs' I was able to find this section of HTML code. As you can see, it's about what you expected to find in the original downloaded HTML. However, this additional step is required to get at it.

              </div></li></ul><h5 class="statistics__subheading">Ryan Getzlaf Game Logs</h5><div id="gamelogsTable"><div class="responsive-datatable">

I should mention that there are alternative ways of accessing such code, one of them being dryscrape. I simply can't be bothered installing that one on this Windows machine.

https://en.xdnf.cn/q/117767.html

Related Q&A

Python recv Loop

I am try to display data that is sent over a socket via an iteration from a loop. The way I have it at the moment isnt working on the Admin client. What should I do to fix my loop? Thank youAdmin t…

gtk+ python entry color [closed]

Its difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying thi…

converting a text corpus to a text document with vocabulary_id and respective tfidf score

I have a text corpus with say 5 documents, every document is separated with each other by /n. I want to provide an id to every word in the document and calculate its respective tfidf score. for example…

Numpy append array isnt working

Why isnt it appending all the lists? test = {file1:{subfile1:[1,2,3],subfile2:[10,11,12]},file5:{subfile1:[4,678,6]},file2:{subfile1:[4,78,6]},file3:{subfile1:[7,8,9]}} testarray = np.array([50,60,70]…

Select a valid choice ModelChoiceField

Whenever im running form.is_valid() i get the error: Select a valid choice. That choice is not one of the availablechoices.Here is what I do in my view:timeframes = HostTimeFrame.objects.all() if reque…

Let a module file use a global variable?

Forgive me if this is just a super easy solution as I am pretty new to Python. Right now Im trying to make a basic video game, and to save space I decided to make a module for a combat encounter -- so …

Python Subprocess readline hangs() after reading all input

I am trying to readlines from the tcp server that I ran in the same script. I am able to send one command and reads its output but at the end (after reading all outputs) it looks like that program hang…

Python Counting countries in dictionary

Im writing a function that counts the number of times a country appears in a dictionary and returns the country that appeared the most. If more then one country appears the most then it should return a…

How to wait for any socket to have data?

Im implementing a socket-client which opens several sockets at the same time. Any socket may have data at a different time and I want to execute code when any socket has data and is readable.Im not sur…

Retrieving ad URLs

Im looking for a way to retrieve the ad URLs for this website. http://www.quiltingboard.com/resources/What I want to do is probably write a script to continuously refresh the page and grab the ad URLs.…