How to scrape multiple result having same tags and class

2024/10/5 15:12:13

My code is accurate for single page but when I run this code for multiple records using for loop and if there are some data missing like person then (as I used index no[1] and [2] for person variable ,location, phone no and cell no but if there are something missing like person name is missing) next record will be extracted at person variable. Could you please fix this issue? here is my code:

import requests
from bs4 import BeautifulSoup
import redef get_page(url):response = requests.get(url)if not response.ok:print('server responded:', response.status_code)else:soup = BeautifulSoup(response.text, 'lxml') # 1. html , 2. parserreturn soupdef get_detail_data(soup):#soup = BeautifulSoup(r.text, 'html.parser')try:title = soup.find("h1", {'class': 'sc-AykKI'}).textexcept:title = 'Empty Title'#print(title)try:person = soup.find("span", {'class': 'Contact__Item-sc-1giw2l4-2 kBpGee'}).text.strip()except:person = 'Empty Person'#print(person)try:addr = soup.findAll("span", {'class': 'Contact__Item-sc-1giw2l4-2 kBpGee'})[1].textexcept:addr = 'Empty Address'#print(addr)#abn = soup.find('div', class_="box__Box-sc-1u3aqjl-0 kxddET").('a').text#print(abn)try:ratting = soup.find("div", {'class': 'Rating__RatingText-sc-1r9ytu8-1 jIdgkl'}).textexcept:ratting = 'Empty Ratting'#print(ratting)try:abn = (re.search('abn\\\\":\\\\"(.*?)\\\\"', soup.text).group(1))except:abn = 'Empty ABN'#print(abn)try:website = (re.search('website\\\\":\\\\"(.*?)\\\\"', soup.text).group(1))except:website = 'Empty Website'#print(website )try:phone = (re.search('phone\\\\":\\\\"(.*?)\\\\"', soup.text).group(1))except:phone = 'Empty Phone No'#print(phone)try:cell = (re.search('mobile\\\\":\\\\"(.*?)\\\\"', soup.text).group(1))except:cell = 'Empty Cell No'#print(cell)data = {'title'         : title,'peron name'    : person,'address'       : addr,'phone no'      : phone,'cell no'       : cell,'abn no'        : abn,'website'       : website}return data
def get_index_data(soup):#soup = BeautifulSoup(r.text, 'html.parser')titles = []for item in soup.findAll("h3", {'class': 'sc-bZQynM sc-iwsKbI dpKmnV'}):urls = (f"https://hipages.com.au{item.previous_element.get('href')}")titles.append(urls)#print(titles)return titlesdef Main():url = "https://hipages.com.au/connect/abcelectricservicespl/service/126298"mainurl = "https://hipages.com.au/find/antenna_services/nsw/sydney"main_titles = get_index_data(get_page(mainurl))for title in main_titles:data1 = get_detail_data(get_page(title))print(data1)Main()
Answer

You need to parse your data from the script tag rather than the spans and divs.

Try this:

import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
from pandas import json_normalize
import jsondef get_page(url):response = requests.get(url)if not response.ok:print('server responded:', response.status_code)else:soup = BeautifulSoup(response.text, 'lxml') return soupdef get_detail_data(url):res = requests.get(url)soup = BeautifulSoup(res.content, "lxml")raw = res.text.split("<script> window.__INITIAL_STATE__=")[1]raw = raw.split("</script>")[0]data = json.loads(raw)data = json.loads(data)cols = ['abn', 'address', 'name', 'primary_location', 'service_area', 'state', 'suburb', 'website']df = pd.DataFrame(data["sites"]["list"]).Tdf = df[cols].reset_index(drop=True)primary_location = json_normalize(df.primary_location[0])df = pd.concat([df, primary_location], axis=1)to_drop = ["primary_location", "is_primary", "suburb_seo_key", "capital_city_seo_key"]df.drop(to_drop, axis=1, inplace=True)return dfdef get_index_data(soup):titles = []for item in soup.findAll("h3", {'class': 'sc-bZQynM sc-iwsKbI dpKmnV'}):urls = (f"https://hipages.com.au{item.previous_element.get('href')}")titles.append(urls)return titlesdef Main():mainurl = "https://hipages.com.au/find/antenna_services/nsw/sydney"main_titles = get_index_data(get_page(mainurl))  final_data = [] for title in main_titles:data = get_detail_data(title)final_data.append(data)return final_datadata = Main()df = pd.concat(data).reset_index(drop=True)
display(df)

This gives you much more detailed data by the way.

https://en.xdnf.cn/q/119913.html

Related Q&A

Is there an alternative for sys.exit() in python?

try:x="blaabla"y="nnlfa" if x!=y:sys.exit()else:print("Error!") except Exception:print(Exception)Im not asking about why it is throwing an error. I know that it raises e…

Adding items to Listbox in Python Tkinter

I would like my Listbox widget to be updated upon clicking of a button. However I encountered a logic error. When I click on the button, nothing happens. No errors at all.listOfCompanies: [[1, ], [2, -…

Policy based design in Python [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.Want to improve this question? Add details and clarify the problem by editing this post.Closed 9 years ago.Improve…

Exception raised: cannot import name manual_seed from torch

im trying to run the AutoClean project on my device (heres my code): import random from AutoClean import AutoClean import pandas as pddef __init__(self, pipeline, resultat ):self.pipeline = pipelinesel…

Compile a C/C++ Program and store standard output in a File via Python

Lets say I have a C/C++ file named userfile.c. Using Python, how can I invoke the local gcc compiler so that the file is compiled and an executable is made? More specifically, I would like to provide …

How to swap maximums with the minimums? (python)

Is there a method to swap the maximum and the minimum of a list? The list will be as follows and the program has to be continued so that it will print the maximum swapped with the minimum, the second …

python object attributes and methods

In python all data is object and any object should have attributes and methods. Does somebody know python object without any attributes and methods?>>> len(dir(1)) 64

How to retrieve nested data with BeautifulSoup?

I have the below webpage source: </li><li class="cl-static-search-result" title="BELLO HONDA ACCORD &quot;95 MIL MILLAS&quot;. REALMENTE COMO NUEVO"><a href=&…

applying onehotencoder on numpy array

I am applying OneHotEncoder on numpy array.Heres the codeprint X.shape, test_data.shape #gives 4100, 15) (410, 15) onehotencoder_1 = OneHotEncoder(categorical_features = [0, 3, 4, 5, 6, 8, 9, 11, 12]) …

How to delete temp folder data using python script [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.Want to improve this question? Update the question so it focuses on one problem only by editing this post.Closed 6…