running bs4 scraper needs to be redefined to enrich the dataset - some issues

2024/7/6 21:31:19

got a bs4 scraper that works with selenium - see far below:

well - it works fine so far:

see far below my approach to fetch some data form the given page: clutch.co/il/it-services

To enrich the scraped data, with additional information, i tried to modify the scraping-logic to extract more details from each company's page. Here's i have to an updated version of the code that extracts the company's website and additional information:

here we have script1

import pandas as pd
from bs4 import BeautifulSoup
from tabulate import tabulate
from selenium import webdriver
from selenium.webdriver.chrome.options import Optionsoptions = Options()
options.headless = True
driver = webdriver.Chrome(options=options)url = "https://clutch.co/il/it-services"
driver.get(url)html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')# Your scraping logic goes here
company_info = soup.select(".directory-list div.provider-info")data_list = []
for info in company_info:company_name = info.select_one(".company_info a").get_text(strip=True)location = info.select_one(".locality").get_text(strip=True)website = info.select_one(".company_info a")["href"]# Additional information you want to extract goes here# For example, you can extract the descriptiondescription = info.select_one(".description").get_text(strip=True)data_list.append({"Company Name": company_name,"Location": location,"Website": website,"Description": description})df = pd.DataFrame(data_list)
df.index += 1print(tabulate(df, headers="keys", tablefmt="psql"))
df.to_csv("it_services_data_enriched.csv", index=False)driver.quit()

ideas to this extended version: well in this code, I added a loop to go through each company's information, extracted the website, and added a placeholder for additional information (in this case, the description). i thougth that i can adapt this loop to extract more data as needed. At least this is the idea.

the working model: i think that the structure of the HTML of course changes here - and therefore in need to adapt the scraping-logik: so i think that i might need to adjust the CSS selectors accordingly based on the current structure of the page. So far so good: Well,i think that we need to make sure to customize the scraping logic based on the specific details we want to extract from each company's page. Conclusio: well i think i am very close: but see what i gotten back: the following

/home/ubuntu/PycharmProjects/clutch_scraper_2/.venv/bin/python /home/ubuntu/PycharmProjects/clutch_scraper_2/clutch_scraper_II.py
/home/ubuntu/PycharmProjects/clutch_scraper_2/clutch_scraper_II.py:2: DeprecationWarning:
Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466import pandas as pd
Traceback (most recent call last):File "/home/ubuntu/PycharmProjects/clutch_scraper_2/clutch_scraper_II.py", line 29, in <module>description = info.select_one(".description").get_text(strip=True)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'get_text'Process finished with exit code 

and now - see below my allready working model: my approach to fetch some data form the given page: clutch.co/il/it-services

here we have script2

import pandas as pd
from bs4 import BeautifulSoup
from tabulate import tabulate
from selenium import webdriver
from selenium.webdriver.chrome.options import Optionsoptions = Options()
options.headless = True
driver = webdriver.Chrome(options=options)url = "https://clutch.co/il/it-services"
driver.get(url)html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')# Your scraping logic goes here
company_names = soup.select(".directory-list div.provider-info--header .company_info a")
locations = soup.select(".locality")company_names_list = [name.get_text(strip=True) for name in company_names]
locations_list = [location.get_text(strip=True) for location in locations]data = {"Company Name": company_names_list, "Location": locations_list}
df = pd.DataFrame(data)
df.index += 1
print(tabulate(df, headers="keys", tablefmt="psql"))
df.to_csv("it_services_data.csv", index=False)driver.quit()

import pandas as pd

+----+-----------------------------------------------------+--------------------------------+
|    | Company Name                                        | Location                       |
|----+-----------------------------------------------------+--------------------------------|
|  1 | Artelogic                                           | L'viv, Ukraine                 |
|  2 | Iron Forge Development                              | Palm Beach Gardens, FL         |
|  3 | Lionwood.software                                   | L'viv, Ukraine                 |
|  4 | Greelow                                             | Tel Aviv-Yafo, Israel          |
|  5 | Ester Digital                                       | Tel Aviv-Yafo, Israel          |
|  6 | Nextly                                              | Vitória, Brazil                |
|  7 | Rootstack                                           | Austin, TX                     |
|  8 | Novo                                                | Dallas, TX                     |
|  9 | Scalo                                               | Tel Aviv-Yafo, Israel          |
| 10 | TLVTech                                             | Herzliya, Israel               |
| 11 | Dofinity                                            | Bnei Brak, Israel              |
| 12 | PURPLE                                              | Petah Tikva, Israel            |
| 13 | Insitu S2 Tikshuv LTD                               | Haifa, Israel                  |
| 14 | Opinov8 Technology Services                         | London, United Kingdom         |
| 15 | Sogo Services                                       | Tel Aviv-Yafo, Israel          |
| 16 | Naviteq LTD                                         | Tel Aviv-Yafo, Israel          |
| 17 | BMT - Business Marketing Tools                      | Ra'anana, Israel               |
| 18 | Profisea                                            | Hod Hasharon, Israel           |
| 19 | MeteorOps                                           | Tel Aviv-Yafo, Israel          |
| 20 | Trivium Solutions                                   | Herzliya, Israel               |
| 21 | Dynomind.tech                                       | Jerusalem, Israel              |
| 22 | Madeira Data Solutions                              | Kefar Sava, Israel             |
| 23 | Titanium Blockchain                                 | Tel Aviv-Yafo, Israel          |
| 24 | Octopus Computer Solutions                          | Tel Aviv-Yafo, Israel          |
| 25 | Reblaze                                             | Tel Aviv-Yafo, Israel          |
| 26 | ELPC Networks Ltd                                   | Rosh Haayin, Israel            |
| 27 | Taldor                                              | Holon, Israel                  |
| 28 | Clarity                                             | Petah Tikva, Israel            |
| 29 | Opsfleet                                            | Kfar Bin Nun, Israel           |
| 30 | Hozek Technologies Ltd.                             | Petah Tikva, Israel            |
| 31 | ERG Solutions                                       | Ramat Gan, Israel              |
| 32 | Komodo Consulting                                   | Ra'anana, Israel               |
| 33 | SCADAfence                                          | Ramat Gan, Israel              |
| 34 | Ness Technologies | נס טכנולוגיות                         | Tel Aviv-Yafo, Israel          |
| 35 | Bynet Data Communications Bynet Data Communications | Tel Aviv-Yafo, Israel          |
| 36 | Radware                                             | Tel Aviv-Yafo, Israel          |
| 37 | BigData Boutique                                    | Rishon LeTsiyon, Israel        |
| 38 | NetNUt                                              | Tel Aviv-Yafo, Israel          |
| 39 | Asperii                                             | Petah Tikva, Israel            |
| 40 | PractiProject                                       | Ramat Gan, Israel              |
| 41 | K8Support                                           | Bnei Brak, Israel              |
| 42 | Odix                                                | Rosh Haayin, Israel            |
| 43 | Panaya                                              | Hod Hasharon, Israel           |
| 44 | MazeBolt Technologies                               | Giv'atayim, Israel             |
| 45 | Porat                                               | Tel Aviv-Jaffa, Israel         |
| 46 | MindU                                               | Tel Aviv-Yafo, Israel          |
| 47 | Valinor Ltd.                                        | Petah Tikva, Israel            |
| 48 | entrypoint                                          | Modi'in-Maccabim-Re'ut, Israel |
| 49 | Adelante                                            | Tel Aviv-Yafo, Israel          |
| 50 | Code n' Roll                                        | Haifa, Israel                  |
| 51 | Linnovate                                           | Bnei Brak, Israel              |
| 52 | Viceman Agency                                      | Tel Aviv-Jaffa, Israel         |
| 53 | develeap                                            | Tel Aviv-Yafo, Israel          |
| 54 | Chalir.com                                          | Binyamina-Giv'at Ada, Israel   |
| 55 | WolfCode                                            | Rishon LeTsiyon, Israel        |
| 56 | Penguin Strategies                                  | Ra'anana, Israel               |
| 57 | ANG Solutions                                       | Tel Aviv-Yafo, Israel          |
+----+-----------------------------------------------------+--------------------------------+

what is aimed: i want to to fetch some more data form the given page: clutch.co/il/it-services - eg the website and so on...

update_: The error AttributeError: 'NoneType' object has no attribute 'get_text' indicates that the .select_one(".description") method did not find any HTML element with the class ".description" for the current company information, resulting in None. Therefore, calling .get_text(strip=True) on None raises an AttributeError.

more to follow... later the day.

update2: note: @jakob had a interesting idea - posted here: Selenium in Google Colab without having to worry about managing the ChromeDriver executable - i tried an example using kora.selenium I made Google-Colab-Selenium to solve this problem. It manages the executable and the required Selenium Options for you. - well that sounds very very interesting - at the moment i cannot imagine that we get selenium working on colab in such a way - that the above mentioned scraper works on colab full and well!? - ideas !? would be awesome:

Jakob: the real issue is that the website you are trying to scrape is using CloudFlare, which can detect selenium. I wrote a little code to scrape the data that you were looking for. You actually don't need to use Selenium as the data is already baked right into the HTML when you go to the webpage.

https://colab.research.google.com/drive/1qkZ1OV_Nqeg13UY3S9pY0IXuB4-q3Mvx?usp=sharing

here we have script3

%pip install -q curl_cffi
%pip install -q fake-useragent
%pip install -q lxmlfrom curl_cffi import requests
from fake_useragent import UserAgent
# we need to take care for this: https://pypi.org/project/fake-useragent/ua = UserAgent()    
headers = {'User-Agent': ua.safari}
resp = requests.get('https://clutch.co/il/it-services', headers=headers, impersonate="safari15_3")
resp.status_code# I like to use this to verify the contents of the request
from IPython.display import HTMLHTML(resp.text)from lxml.html import fromstringtree = fromstring(resp.text)data = []for company in tree.xpath('//ul/li[starts-with(@id, "provider")]'):data.append({"name": company.xpath('./@data-title')[0].strip(),"location": company.xpath('.//span[@class = "locality"]')[0].text,"wage": company.xpath('.//div[@data-content = "<i>Avg. hourly rate</i>"]/span/text()')[0].strip(),"min_project_size": company.xpath('.//div[@data-content = "<i>Min. project size</i>"]/span/text()')[0].strip(),"employees": company.xpath('.//div[@data-content = "<i>Employees</i>"]/span/text()')[0].strip(),"description": company.xpath('.//blockquote//p')[0].text,"website_link": (company.xpath('.//a[contains(@class, "website-link__item")]/@href') or ['Not Available'])[0],})import pandas as pd
from pandas import json_normalize
df = json_normalize(data, max_level=0)
df

that said - well i think that i understand the approach - fetching the HTML and then working with xpath the thing i have difficulties is the user-agent .. part

it works awesome - it is just overwhelming...!!!

Answer

TL;DR

Change this line:

description = info.select_one(".description").get_text(strip=True)

to this:

description = [i for i in info.find_all("div") if 'description' in ''.join(i['class'])][0].get_text(strip=True)

This will find the tags that has description in their class, regardless it's a single class name or it's a part of a whole class name.


Explanation

I'm not an expert in beautifulsoup, I really encourage not using it if you're already dealing with Selenium (selection with selenium is WAY EASIER if you learn XPATH). Anyways, only one modification is needed in your code to work:

It's this line:

description = info.select_one(".description").get_text(strip=True)

should be like this:

description = [i for i in info.find_all("div") if 'description' in ''.join(i['class'])][0].get_text(strip=True)

You original code had this:

info.select_one(".description")

This will try to find an element in the info element you found. There's always a div that has this class: col-md-3 provider-info__description.

COOL! we found the element, but bs4 didn't find it. That's because the .select_one and select functions will split the classes into a list.

So the class we've seen earlier would look like this: ['col-md-3', 'provider-info__description']

If you want to test it yourself, try this code:

for i in company_info[0].find_all("div"):print(i['class'])

This will print all the classes for all div tags it will find. You'll see ['col-md-3', 'provider-info__description'] at the bottom.

I don't know why you're using .select and .select_one, I usually use .find and .find_all (it needs the tag name, and you can specify classes and other attributes instead of a css selector).

So you could either replace all .select to .find_all or you would only replace it in this situation (my solution).

OK, back to the solution. So, let's see the new line of code again:

description = [i for i in info.find_all("div") if 'description' in ''.join(i['class'])][0].get_text(strip=True)

This line will look for all the div tags that are inside your info element. Then, it'll only select the ones that have description inside their class.

''.join(i['class'])][0].get_text(strip=True)

NOTE: if you're confused with the syntax of this part, this is called a Python comprehension. See here.

The join part will combine all the class names in one string, so we don't need to find description as a separate class, but we only wanna know if description is there or not.

This solution should almost always work.

Hope it helps!

https://en.xdnf.cn/q/119857.html

Related Q&A

Uploading a file in a embed discord.py (not a image)

Im trying to upload a file directly in a embed, I can upload the file but I dont find the way to put it in the embed. What I want is not displaying the file but uploading it so we can download it, is i…

Cannot install psycopg2 on virtualenv

Hi I use manjaro Linux and I tryed to install psycopg2 packge inside virtualenv but it gave errror error: command gcc failed with exit status 1. Then in the console I tryed gcc --version it saidbash: …

how to execute different print function based on the users input

I am a complete beginner to coding and python so It is probably very simple. So my problem is that am learning how to put if and else function based on the users input and i dont know how to connect be…

matplotlib - AttributeError: module numbers has no attribute Integral

I am a newbie to python and i am trying to learn online. I tried importing matplotlib on python 3.6 but i keep getting this error:problem in matplotlib - AttributeError: module numbers has no attribute…

How to extract social information from a given website?

I have a Website URL Like www.example.comI want to collect social information from this website like : facebook url (facebook.com/example ), twitter url ( twitter.com/example ) etc., if available anywh…

Check if string is of nine digits then exit function in python

I have a function in python that returns different output of strings (Text). And I have different parts that I should check for the string and if the string is of nine digits or contains 9 digits then …

How to extract quotations from text using NLTK [duplicate]

This question already has answers here:RegEx: Grabbing values between quotation marks(20 answers)Closed 8 years ago.I have a project wherein I need to extract quotations from a huge set of articles . H…

takes exactly 2 arguments (1 given) when including self [closed]

Its difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying thi…

scipy.optimize.curve_fit a definite integral function with scipy.integrate.quad

If I have a function that the independent variable is the upper limit of an definite integral of a mathematical model. This mathematical model has the parameters I want to do regression. This mathemati…

MAC OS - os.system(command) display nothing

When I run IDLE (python 3.8) :>>> import os >>> os.system("ls") 0 >>> os.system(echo "test") 0 >>> os.system("users") 0 >>> Bu…