Cant get javascript generated html using python

2024/9/22 7:29:10

I'm trying to create a python script that automatically gets the content of a table on a webpage. I manage to have it to work on pure html page, but there is one website that gives me headache... The html seems to be generated by javascript. I tried dryscrape, selenium and qt4 libraries from examples found on several posts but still without success... I just get all the time the html before the javascript did his job.... so without tables.... I can see the table on the browser and when I do "Inspect" the html with Chrome. When I do "View Page Source" in Chrome the table is also not there... may be this can give some hints.

The website is the following:

https://www.ictax.admin.ch/extern/en.html#/security/CH0008899764/20161231

Here is some code I tried out (no table tags in the answer if you check):

Using urlib2:

import urllib2
url="https://www.ictax.admin.ch/extern/en.html#/security/CH0008899764/20161231"
html = urllib2.urlopen(url)
print html

Using dryscrape:

import dryscrape 
session = dryscrape.Session()
session.visit(url) 
response = session.body()
print response

Using selenium:

from selenium import webdriver
driver = webdriver.Chrome("/usr/lib/chromium/chromedriver")
driver.get(url)
print driver.page_source #page_source fetches page after rendering is complete
driver.quit()

Using PyQt4

import sys  
from PyQt4.QtGui import *  
from PyQt4.QtCore import *  
from PyQt4.QtWebKit import *  
from lxml import html class Render(QWebPage):  def __init__(self, url):  self.app = QApplication(sys.argv)  QWebPage.__init__(self)  self.loadFinished.connect(self._loadFinished)  self.mainFrame().load(QUrl(url))  self.app.exec_()  def _loadFinished(self, result):  self.frame = self.mainFrame()  self.app.quit() #This does the magic.Loads everything
r = Render(url)  
#result is a QString.
result = r.frame.toHtml()
#QString should be converted to string before processed by lxml
formatted_result = str(result.toAscii())
print formatted_result

I would appreciate so much if somebody could give me some help on this :-)

Cheers

Answer

Use an implicit wait (or an explicit one?) to wait for the page to load before searching for any elements:

import selenium
from selenium import webdriver
driver = webdriver.PhantomJS()
url = "https://www.ictax.admin.ch/extern/en.html#/security/CH0008899764/20161231"
driver.get(url)
driver.implicitly_wait(30)
print(driver.find_element_by_tag_name("table").text)

This is the output I am getting:

Titel/Titres/Titoli W Nominell Valoren-Nr. Steuerwert Ertrag /Rendement / Reddito 2016 M Valeur No de Val. imposable Datum / DateCp. W Brutto KG/KEP zu versteuernder V nominale valeur Val. imposibleData M Brut Ertrag/Rendement Valore Numero di 31.12.2016 ex. zahlb. Vlordo imposable/Reddito nominale valore pay. imponible CHF (E) pag.Fr.W. CHF CHF iShares ETF (CH) - iShares SMI (R) (CH), Schweiz
CHF 0.00 889 976 85.31 25.02. 29.02. 36 CHF 0.48
03.03. 07.03. 37 CHF 0.48
11.04. 13.04. 38 CHF 0.70
19.07. 21.07. 40 CHF 0.88
19.07. 21.07. 39 CHF 0.20

https://en.xdnf.cn/q/119614.html

Related Q&A

Python: Extract text from Word files in a url

Given the url containing a certain file, in this case a word document, read the contents of the document. I have seen several examples of how to extract text from local documents but not from a url. Wo…

Python3:Plot f(x,y), preferably using matplotlib

Is there a way, preferably using matplotlib, to plot a 2-variable function f(x,y) in python; Thank you, in advance.

Why does my cronjob not send the email from my script? [closed]

Closed. This question needs debugging details. It is not currently accepting answers.Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to repro…

How to delete unsaved tkinker label?

I made this program where I am putting labels on a grid without saving them in a variable. I do this because then I can for loop through a list of classes and get the data from each class in and add th…

Adjust every other row of a data frame

I would like to change every second row of my data frame.I have a df like this:Node | Feature | Indicator | Value | Class | Direction -------------------------------------------------------- 1 | …

Why is the list index out of range?

Im new at programing and Im trying to check a piece of code that keeps giving me this error: t[i] = t[i - 1] + dt IndexError: list index out of rangeThe code is the following: dt = 0.001t = [0] for i i…

Stopping a while loop mid-way - Python

What is the best way to stop a while loop in Python mid-way through the statement? Im aware of break but I thought using this would be bad practice.For example, in this code below, I only want the pro…

Click on element in dropdown with Selenium and Python

With Selenium and Chrome webdriver on MacOS need to click dropdown element. But always have an error that cant find. Have this html code on a page where it located:<select id="periodoExtrato&qu…

Send cv2 video stream for face recognition

Im struggling with a problem to send a cv2 videostream (webcam) to a server (which shall be used later for face recognition). I keep getting the following error for the server: Traceback (most recent c…

Generate all possible lists from the sublist in python [duplicate]

This question already has answers here:How to get the Cartesian product of multiple lists(20 answers)Closed 7 years ago.Suppose I have list [[a, b, c], [d, e], [1, 2]]I want to generate list where on t…