Question 1

I am an absolute beginner to Web Scraping using Python and know very little about programming in Python. I am just trying to extract the information of the lawyers in the Tennessee location. In the webpage, there are multiple links, within which there are further links to the categories of lawyers, and within those are the lawyers' details.

I have already extracted the links of the various cities into a list and have also extracted the various categories of lawyers available in each of the cities' links. The profile links have also been fetched and stored as a set. Now I am trying to fetch each lawyer's name, address, firm name and practice area and store it as an .xls file.

import requests
from bs4 import BeautifulSoup as bs
import pandas as pdfinal=[]
records=[]
with requests.Session() as s:res = s.get('https://attorneys.superlawyers.com/tennessee/', headers = {'User-agent': 'Super Bot 9000'})soup = bs(res.content, 'lxml')cities = [item['href'] for item in soup.select('#browse_view a')]for c in cities:r=s.get(c)s1=bs(r.content,'lxml')categories = [item['href'] for item in s1.select('.three_browse_columns:nth-of-type(2) a')]for c1 in categories:r1=s.get(c1)s2=bs(r1.content,'lxml')lawyers = [item['href'].split('*')[1] if '*' in item['href'] else item['href'] for item ins2.select('.indigo_text .directory_profile')]final.append(lawyers)
final_list={item for sublist in final for item in sublist}
for i in final_list:r2 = s.get(i)s3 = bs(r2.content, 'lxml')name = s3.find('h2').text.strip()add = s3.find("div").text.strip()f_name = s3.find("a").text.strip()p_area = s3.find('ul',{"class":"basic_profile aag_data_value"}).find('li').text.strip()records.append({'Names': name, 'Address': add, 'Firm Name': f_name,'Practice Area':p_area})
df = pd.DataFrame(records,columns=['Names','Address','Firm Name','Practice Areas'])
df=df.drop_duplicates()
df.to_excel(r'C:\Users\laptop\Desktop\lawyers.xls', sheet_name='MyData2', index = False, header=True)

I expected to get a .xls file, but nothing is returned as the execution is going on. It does not terminate until I force stop, and no .xls file is made.

Question 2

You need to extract those details by visiting each lawyer's page and using the appropriate selectors. Something like:

import requests
from bs4 import BeautifulSoup as bs
import pandas as pdrecords = []
final = []with requests.Session() as s:res = s.get('https://attorneys.superlawyers.com/tennessee/', headers = {'User-agent': 'Super Bot 9000'})soup = bs(res.content, 'lxml')cities = [item['href'] for item in soup.select('#browse_view a')]for c in cities:r = s.get(c)s1 = bs(r.content,'lxml')categories = [item['href'] for item in s1.select('.three_browse_columns:nth-of-type(2) a')]for c1 in categories:r1 = s.get(c1)s2 = bs(r1.content,'lxml')lawyers = [item['href'].split('*')[1] if '*' in item['href'] else item['href'] for item in s2.select('.indigo_text .directory_profile')]final.append(lawyers)final_list = {item for sublist in final for item in sublist}for link in final_list:r = s.get(link)soup = bs(r.content, 'lxml')name = soup.select_one('#lawyer_name').textfirm = soup.select_one('#firm_profile_page').textaddress = ' '.join([string for string in soup.select_one('#poap_postal_addr_block').stripped_strings][1:])practices = ' '.join([item.text for item in soup.select('#pa_list li')])row = [name, firm, address, practices]records.append(row)df = pd.DataFrame(records, columns = ['Name', 'Firm', 'Address', 'Practices'])
print(df)
df.to_csv(r'C:\Users\User\Desktop\Lawyers.csv', sep=',', encoding='utf-8-sig',index = False )

Fetching Lawyers details from a set of urls using bs4 in python

Related Q&A

Having trouble with python simple code running in console [closed]

.upper not working in python

Python SKlearn fit method not working

extracting n grams from huge text

Python: Input validate with string length

Mergesort Python implementation

Use variable in different class [duplicate]

Embedded function returns None

calculate days between several dates in python

Appeding different list values to dictionary in python