Fetching Lawyers details from a set of urls using bs4 in python

2024/7/6 19:41:15

I am an absolute beginner to Web Scraping using Python and know very little about programming in Python. I am just trying to extract the information of the lawyers in the Tennessee location. In the webpage, there are multiple links, within which there are further links to the categories of lawyers, and within those are the lawyers' details.

I have already extracted the links of the various cities into a list and have also extracted the various categories of lawyers available in each of the cities' links. The profile links have also been fetched and stored as a set. Now I am trying to fetch each lawyer's name, address, firm name and practice area and store it as an .xls file.

import requests
from bs4 import BeautifulSoup as bs
import pandas as pdfinal=[]
records=[]
with requests.Session() as s:res = s.get('https://attorneys.superlawyers.com/tennessee/', headers = {'User-agent': 'Super Bot 9000'})soup = bs(res.content, 'lxml')cities = [item['href'] for item in soup.select('#browse_view a')]for c in cities:r=s.get(c)s1=bs(r.content,'lxml')categories = [item['href'] for item in s1.select('.three_browse_columns:nth-of-type(2) a')]for c1 in categories:r1=s.get(c1)s2=bs(r1.content,'lxml')lawyers = [item['href'].split('*')[1] if '*' in item['href'] else item['href'] for item ins2.select('.indigo_text .directory_profile')]final.append(lawyers)
final_list={item for sublist in final for item in sublist}
for i in final_list:r2 = s.get(i)s3 = bs(r2.content, 'lxml')name = s3.find('h2').text.strip()add = s3.find("div").text.strip()f_name = s3.find("a").text.strip()p_area = s3.find('ul',{"class":"basic_profile aag_data_value"}).find('li').text.strip()records.append({'Names': name, 'Address': add, 'Firm Name': f_name,'Practice Area':p_area})
df = pd.DataFrame(records,columns=['Names','Address','Firm Name','Practice Areas'])
df=df.drop_duplicates()
df.to_excel(r'C:\Users\laptop\Desktop\lawyers.xls', sheet_name='MyData2', index = False, header=True)

I expected to get a .xls file, but nothing is returned as the execution is going on. It does not terminate until I force stop, and no .xls file is made.

Answer

You need to extract those details by visiting each lawyer's page and using the appropriate selectors. Something like:

import requests
from bs4 import BeautifulSoup as bs
import pandas as pdrecords = []
final = []with requests.Session() as s:res = s.get('https://attorneys.superlawyers.com/tennessee/', headers = {'User-agent': 'Super Bot 9000'})soup = bs(res.content, 'lxml')cities = [item['href'] for item in soup.select('#browse_view a')]for c in cities:r = s.get(c)s1 = bs(r.content,'lxml')categories = [item['href'] for item in s1.select('.three_browse_columns:nth-of-type(2) a')]for c1 in categories:r1 = s.get(c1)s2 = bs(r1.content,'lxml')lawyers = [item['href'].split('*')[1] if '*' in item['href'] else item['href'] for item in s2.select('.indigo_text .directory_profile')]final.append(lawyers)final_list = {item for sublist in final for item in sublist}for link in final_list:r = s.get(link)soup = bs(r.content, 'lxml')name = soup.select_one('#lawyer_name').textfirm = soup.select_one('#firm_profile_page').textaddress = ' '.join([string for string in soup.select_one('#poap_postal_addr_block').stripped_strings][1:])practices = ' '.join([item.text for item in soup.select('#pa_list li')])row = [name, firm, address, practices]records.append(row)df = pd.DataFrame(records, columns = ['Name', 'Firm', 'Address', 'Practices'])
print(df)
df.to_csv(r'C:\Users\User\Desktop\Lawyers.csv', sep=',', encoding='utf-8-sig',index = False )
https://en.xdnf.cn/q/120228.html

Related Q&A

Having trouble with python simple code running in console [closed]

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.This question was caused by a typo or a problem that can no longer be reproduced. While similar q…

.upper not working in python

I currently have this codenum_lines = int(input()) lines = [] tempy = ctr = 1 abc = {a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z } for i in range(0, num_lines):tempy = input()lines.append([])l…

Python SKlearn fit method not working

Im working on a project using Python(3.6) and Sklearn.I have done classifications but when I try to apply it for reshaping in order to use it with fit method of sklearn it returns an error. Heres what …

extracting n grams from huge text

For example we have following text:"Spark is a framework for writing fast, distributed programs. Sparksolves similar problems as Hadoop MapReduce does but with a fastin-memory approach and a clean…

Python: Input validate with string length

Ok so i need to ensure that a phone number length is correct. I came up with this but get a syntax error.phone = int(input("Please enter the customers Phone Number.")) if len(str(phone)) == 1…

Mergesort Python implementation

I have seen a lot of mergesort Python implementation and I came up with the following code. The general logic is working fine, but it is not returning the right results. How can I fix it? Code: def me…

Use variable in different class [duplicate]

This question already has answers here:How to access variables from different classes in tkinter?(2 answers)Closed 7 years ago.I am a beginner in python. I have a problem with using variable in differ…

Embedded function returns None

My function returns None. I have checked to make sure all the operations are correct, and that I have a return statement for each function.def parameter_function(principal, annual_interest_rate, durati…

calculate days between several dates in python

I have a file with a thousand lines. Theres 12 different dates in a single row. Im looking for two conditions. First: It should analyze row by row. For every row, it should check only for the dates bet…

Appeding different list values to dictionary in python

I have three lists containing different pattern of values. This should append specific values only inside a single dictionary based on some if condition.I have tried the following way to do so but i go…