On the website https://sray.arabesque.com/dashboard there is a search box "input" in html. I want to enter a company name in the search box, choose the first suggestion for that name in the dropout menu (e.g., "Anglo American plc"), go to the url with the info about that company, load javascripts to get full html version of the obtained page, and then scrape it for GC Score, ESG Score, Temperature Score in the bottom.
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin
!pip install seleniumfrom selenium import webdriver
from selenium.webdriver.common.keys import Keys
options = webdriver.ChromeOptions()
options.add_argument('-headless')
options.add_argument('-no-sandbox')
options.add_argument('-disable-dev-shm-usage')wd = webdriver.Chrome('chromedriver',options=options)companies = ['Anglo American plc']for company in companies:# dryscrape.start_xvfb()# session = dryscrape.Session()# session.visit("https://srayapi.arabesque.com/api/sray/company/history/004BTP-E")resp = wd.get('https://sray.arabesque.com/dashboard/')
#print(driver.page_source)e = wd.find_element_by_id(id_='mat-input-0')e.send_keys(company)e.send_keys(Keys.ENTER)innerHTML = e.execute_script("return document.body.innerHTML")print(innerHTML)
I don't quite understand how to visit an URL with info about Anglo American and scrape it if we don't know the URL after entering the company name in the search box.
You can do that using selenium.Couple of things you need to update.
While interacting headless you need to provide window size
.
Induce WebDriverWait
() to avoid synchronization issue.
Code:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import Byoptions = webdriver.ChromeOptions()
options.add_argument('-headless')
options.add_argument('-no-sandbox')
options.add_argument('-disable-dev-shm-usage')
options.add_argument('window-size=1920,1080')wd = webdriver.Chrome(options=options)companies = ['Anglo American plc']for company in companies:wd.get('https://sray.arabesque.com/dashboard/')WebDriverWait(wd, 20).until(EC.element_to_be_clickable((By.XPATH, "//a[text()='list']"))).click()WebDriverWait(wd, 20).until(EC.element_to_be_clickable((By.XPATH, "//input[@id='mat-input-0']"))).send_keys(company)WebDriverWait(wd, 20).until(EC.element_to_be_clickable((By.XPATH, "//span[contains(.,' Anglo American plc ')]"))).click()WebDriverWait(wd, 20).until(EC.element_to_be_clickable((By.XPATH, "(//span[normalize-space(.)='Open dashboard'])[1]"))).click()WebDriverWait(wd,10).until(EC.visibility_of_element_located((By.CSS_SELECTOR,"div.mat-tab-labels")))print(wd.find_element_by_xpath("//div[@class='mat-tab-label-content'][contains(.,'GC Score')]/span").text)print(wd.find_element_by_xpath("//div[@class='mat-tab-label-content'][contains(.,'ESG Score')]/span").text)print(wd.find_element_by_xpath("//div[@class='mat-tab-label-content'][contains(.,'Temp')]/span").text)
Output:
57.03
53.78
2.7°C