got a bs4 scraper that works with selenium - see far below:
well - it works fine so far:
see far below my approach to fetch some data form the given page: clutch.co/il/it-services
To enrich the scraped data, with additional information, i tried to modify the scraping-logic to extract more details from each company's page. Here's i have to an updated version of the code that extracts the company's website and additional information:
here we have script1
import pandas as pd
from bs4 import BeautifulSoup
from tabulate import tabulate
from selenium import webdriver
from selenium.webdriver.chrome.options import Optionsoptions = Options()
options.headless = True
driver = webdriver.Chrome(options=options)url = "https://clutch.co/il/it-services"
driver.get(url)html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')# Your scraping logic goes here
company_info = soup.select(".directory-list div.provider-info")data_list = []
for info in company_info:company_name = info.select_one(".company_info a").get_text(strip=True)location = info.select_one(".locality").get_text(strip=True)website = info.select_one(".company_info a")["href"]# Additional information you want to extract goes here# For example, you can extract the descriptiondescription = info.select_one(".description").get_text(strip=True)data_list.append({"Company Name": company_name,"Location": location,"Website": website,"Description": description})df = pd.DataFrame(data_list)
df.index += 1print(tabulate(df, headers="keys", tablefmt="psql"))
df.to_csv("it_services_data_enriched.csv", index=False)driver.quit()
ideas to this extended version: well in this code, I added a loop to go through each company's information, extracted the website, and added a placeholder for additional information (in this case, the description). i thougth that i can adapt this loop to extract more data as needed. At least this is the idea.
the working model: i think that the structure of the HTML of course changes here - and therefore in need to adapt the scraping-logik: so i think that i might need to adjust the CSS selectors accordingly based on the current structure of the page. So far so good: Well,i think that we need to make sure to customize the scraping logic based on the specific details we want to extract from each company's page. Conclusio: well i think i am very close: but see what i gotten back: the following
/home/ubuntu/PycharmProjects/clutch_scraper_2/.venv/bin/python /home/ubuntu/PycharmProjects/clutch_scraper_2/clutch_scraper_II.py
/home/ubuntu/PycharmProjects/clutch_scraper_2/clutch_scraper_II.py:2: DeprecationWarning:
Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466import pandas as pd
Traceback (most recent call last):File "/home/ubuntu/PycharmProjects/clutch_scraper_2/clutch_scraper_II.py", line 29, in <module>description = info.select_one(".description").get_text(strip=True)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'get_text'Process finished with exit code
and now - see below my allready working model: my approach to fetch some data form the given page: clutch.co/il/it-services
here we have script2
import pandas as pd
from bs4 import BeautifulSoup
from tabulate import tabulate
from selenium import webdriver
from selenium.webdriver.chrome.options import Optionsoptions = Options()
options.headless = True
driver = webdriver.Chrome(options=options)url = "https://clutch.co/il/it-services"
driver.get(url)html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')# Your scraping logic goes here
company_names = soup.select(".directory-list div.provider-info--header .company_info a")
locations = soup.select(".locality")company_names_list = [name.get_text(strip=True) for name in company_names]
locations_list = [location.get_text(strip=True) for location in locations]data = {"Company Name": company_names_list, "Location": locations_list}
df = pd.DataFrame(data)
df.index += 1
print(tabulate(df, headers="keys", tablefmt="psql"))
df.to_csv("it_services_data.csv", index=False)driver.quit()
import pandas as pd
+----+-----------------------------------------------------+--------------------------------+
| | Company Name | Location |
|----+-----------------------------------------------------+--------------------------------|
| 1 | Artelogic | L'viv, Ukraine |
| 2 | Iron Forge Development | Palm Beach Gardens, FL |
| 3 | Lionwood.software | L'viv, Ukraine |
| 4 | Greelow | Tel Aviv-Yafo, Israel |
| 5 | Ester Digital | Tel Aviv-Yafo, Israel |
| 6 | Nextly | Vitória, Brazil |
| 7 | Rootstack | Austin, TX |
| 8 | Novo | Dallas, TX |
| 9 | Scalo | Tel Aviv-Yafo, Israel |
| 10 | TLVTech | Herzliya, Israel |
| 11 | Dofinity | Bnei Brak, Israel |
| 12 | PURPLE | Petah Tikva, Israel |
| 13 | Insitu S2 Tikshuv LTD | Haifa, Israel |
| 14 | Opinov8 Technology Services | London, United Kingdom |
| 15 | Sogo Services | Tel Aviv-Yafo, Israel |
| 16 | Naviteq LTD | Tel Aviv-Yafo, Israel |
| 17 | BMT - Business Marketing Tools | Ra'anana, Israel |
| 18 | Profisea | Hod Hasharon, Israel |
| 19 | MeteorOps | Tel Aviv-Yafo, Israel |
| 20 | Trivium Solutions | Herzliya, Israel |
| 21 | Dynomind.tech | Jerusalem, Israel |
| 22 | Madeira Data Solutions | Kefar Sava, Israel |
| 23 | Titanium Blockchain | Tel Aviv-Yafo, Israel |
| 24 | Octopus Computer Solutions | Tel Aviv-Yafo, Israel |
| 25 | Reblaze | Tel Aviv-Yafo, Israel |
| 26 | ELPC Networks Ltd | Rosh Haayin, Israel |
| 27 | Taldor | Holon, Israel |
| 28 | Clarity | Petah Tikva, Israel |
| 29 | Opsfleet | Kfar Bin Nun, Israel |
| 30 | Hozek Technologies Ltd. | Petah Tikva, Israel |
| 31 | ERG Solutions | Ramat Gan, Israel |
| 32 | Komodo Consulting | Ra'anana, Israel |
| 33 | SCADAfence | Ramat Gan, Israel |
| 34 | Ness Technologies | נס טכנולוגיות | Tel Aviv-Yafo, Israel |
| 35 | Bynet Data Communications Bynet Data Communications | Tel Aviv-Yafo, Israel |
| 36 | Radware | Tel Aviv-Yafo, Israel |
| 37 | BigData Boutique | Rishon LeTsiyon, Israel |
| 38 | NetNUt | Tel Aviv-Yafo, Israel |
| 39 | Asperii | Petah Tikva, Israel |
| 40 | PractiProject | Ramat Gan, Israel |
| 41 | K8Support | Bnei Brak, Israel |
| 42 | Odix | Rosh Haayin, Israel |
| 43 | Panaya | Hod Hasharon, Israel |
| 44 | MazeBolt Technologies | Giv'atayim, Israel |
| 45 | Porat | Tel Aviv-Jaffa, Israel |
| 46 | MindU | Tel Aviv-Yafo, Israel |
| 47 | Valinor Ltd. | Petah Tikva, Israel |
| 48 | entrypoint | Modi'in-Maccabim-Re'ut, Israel |
| 49 | Adelante | Tel Aviv-Yafo, Israel |
| 50 | Code n' Roll | Haifa, Israel |
| 51 | Linnovate | Bnei Brak, Israel |
| 52 | Viceman Agency | Tel Aviv-Jaffa, Israel |
| 53 | develeap | Tel Aviv-Yafo, Israel |
| 54 | Chalir.com | Binyamina-Giv'at Ada, Israel |
| 55 | WolfCode | Rishon LeTsiyon, Israel |
| 56 | Penguin Strategies | Ra'anana, Israel |
| 57 | ANG Solutions | Tel Aviv-Yafo, Israel |
+----+-----------------------------------------------------+--------------------------------+
what is aimed: i want to to fetch some more data form the given page: clutch.co/il/it-services - eg the website and so on...
update_: The error AttributeError: 'NoneType' object has no attribute 'get_text' indicates that the .select_one(".description") method did not find any HTML element with the class ".description" for the current company information, resulting in None. Therefore, calling .get_text(strip=True) on None raises an AttributeError.
more to follow... later the day.
update2: note: @jakob had a interesting idea - posted here: Selenium in Google Colab without having to worry about managing the ChromeDriver executable - i tried an example using kora.selenium I made Google-Colab-Selenium to solve this problem. It manages the executable and the required Selenium Options for you. - well that sounds very very interesting - at the moment i cannot imagine that we get selenium working on colab in such a way - that the above mentioned scraper works on colab full and well!? - ideas !? would be awesome:
Jakob: the real issue is that the website you are trying to scrape is using CloudFlare, which can detect selenium. I wrote a little code to scrape the data that you were looking for. You actually don't need to use Selenium as the data is already baked right into the HTML when you go to the webpage.
https://colab.research.google.com/drive/1qkZ1OV_Nqeg13UY3S9pY0IXuB4-q3Mvx?usp=sharing
here we have script3
%pip install -q curl_cffi
%pip install -q fake-useragent
%pip install -q lxmlfrom curl_cffi import requests
from fake_useragent import UserAgent
# we need to take care for this: https://pypi.org/project/fake-useragent/ua = UserAgent()
headers = {'User-Agent': ua.safari}
resp = requests.get('https://clutch.co/il/it-services', headers=headers, impersonate="safari15_3")
resp.status_code# I like to use this to verify the contents of the request
from IPython.display import HTMLHTML(resp.text)from lxml.html import fromstringtree = fromstring(resp.text)data = []for company in tree.xpath('//ul/li[starts-with(@id, "provider")]'):data.append({"name": company.xpath('./@data-title')[0].strip(),"location": company.xpath('.//span[@class = "locality"]')[0].text,"wage": company.xpath('.//div[@data-content = "<i>Avg. hourly rate</i>"]/span/text()')[0].strip(),"min_project_size": company.xpath('.//div[@data-content = "<i>Min. project size</i>"]/span/text()')[0].strip(),"employees": company.xpath('.//div[@data-content = "<i>Employees</i>"]/span/text()')[0].strip(),"description": company.xpath('.//blockquote//p')[0].text,"website_link": (company.xpath('.//a[contains(@class, "website-link__item")]/@href') or ['Not Available'])[0],})import pandas as pd
from pandas import json_normalize
df = json_normalize(data, max_level=0)
df
that said - well i think that i understand the approach - fetching the HTML and then working with xpath the thing i have difficulties is the user-agent .. part
it works awesome - it is just overwhelming...!!!