I'm trying to create a python script that automatically gets the content of a table on a webpage. I manage to have it to work on pure html page, but there is one website that gives me headache... The html seems to be generated by javascript. I tried dryscrape, selenium and qt4 libraries from examples found on several posts but still without success... I just get all the time the html before the javascript did his job.... so without tables.... I can see the table on the browser and when I do "Inspect" the html with Chrome. When I do "View Page Source" in Chrome the table is also not there... may be this can give some hints.
The website is the following:
https://www.ictax.admin.ch/extern/en.html#/security/CH0008899764/20161231
Here is some code I tried out (no table tags in the answer if you check):
Using urlib2:
import urllib2
url="https://www.ictax.admin.ch/extern/en.html#/security/CH0008899764/20161231"
html = urllib2.urlopen(url)
print html
Using dryscrape:
import dryscrape
session = dryscrape.Session()
session.visit(url)
response = session.body()
print response
Using selenium:
from selenium import webdriver
driver = webdriver.Chrome("/usr/lib/chromium/chromedriver")
driver.get(url)
print driver.page_source #page_source fetches page after rendering is complete
driver.quit()
Using PyQt4
import sys
from PyQt4.QtGui import *
from PyQt4.QtCore import *
from PyQt4.QtWebKit import *
from lxml import html class Render(QWebPage): def __init__(self, url): self.app = QApplication(sys.argv) QWebPage.__init__(self) self.loadFinished.connect(self._loadFinished) self.mainFrame().load(QUrl(url)) self.app.exec_() def _loadFinished(self, result): self.frame = self.mainFrame() self.app.quit() #This does the magic.Loads everything
r = Render(url)
#result is a QString.
result = r.frame.toHtml()
#QString should be converted to string before processed by lxml
formatted_result = str(result.toAscii())
print formatted_result
I would appreciate so much if somebody could give me some help on this :-)
Cheers