Python: How to access and iterate over a list of div class element using (BeautifulSoup)

2024/10/14 9:26:48

I'm parsing data about car production with BeautifulSoup (see also my first question):

from bs4 import BeautifulSoup
import stringhtml = """
<h4>Production Capacity (year)</h4><div class="profile-area">Vehicle 1,140,000 units /year</div>
<h4>Output</h4><div class="profile-area">Vehicle 809,000 units ( 2016 ) </div><div class="profile-area">Vehicle 815,000 units ( 2015 ) </div><div class="profile-area">Vehicle 836,000 units ( 2014 ) </div><div class="profile-area">Vehicle 807,000 units ( 2013 ) </div><div class="profile-area">Vehicle 760,000 units ( 2012 ) </div><div class="profile-area">Vehicle 805,000 units ( 2011 ) </div>
"""
soup = BeautifulSoup(html, 'lxml')for item in soup.select("div.profile-area"):produkz = item.text.strip()produkz = produkz.replace('\n',':')prev_h4 = str(item.find_previous_sibling('h4'))if "Models" in prev_h4:models=produkzelse:models=""if "Capacity" in prev_h4:capacity=produkzelse:capacity=""if "( 2015 )" in produkz:prod15=produkzelse:prod15=""if "( 2016 )" in produkz:prod16=produkzelse:prod16=""if "( 2017 )" in produkz:prod17=produkzelse:prod17=""print(models+';'+capacity+';'+prod15+';'+prod16+';'+prod17)

My problem is, that the next loop on all matching HTML occurrences ("div.profile-area") overwrites my result:

;Vehicle 1,140,000 units /year;;;;;;
;;;;;;Vehicle 809,000 units ( 2016 );
;;;;;Vehicle 815,000 units ( 2015 );;
;;;;Vehicle 836,000 units ( 2014 );;;
;;;Vehicle 807,000 units ( 2013 );;;;
;;Vehicle 760,000 units ( 2012 );;;;;
;;;;;;;

My desired result is:

;Vehicle 1,140,000 units /year;Vehicle 760,000 units ( 2012 );Vehicle 807,000 units ( 2013 );Vehicle 836,000 units ( 2014 );Vehicle 815,000 units ( 2015 );Vehicle 809,000 units ( 2016 );

I would be glad if you could show me a better way to structure my code. Thanks in advance.

Answer

This is my solution, You need to take care of each element tag and parse it accordingly. I went further to your problem and offered a more flexible way to access each data value. hope it helps.

import refrom bs4 import BeautifulSouphtml_doc = """
<h4>Production Capacity (year)</h4><div class="profile-area">Vehicle 1,140,000 units /year</div>
<h4>Output</h4><div class="profile-area">Vehicle 809,000 units ( 2016 ) </div><div class="profile-area">Vehicle 815,000 units ( 2015 ) </div><div class="profile-area">Vehicle 836,000 units ( 2014 ) </div><div class="profile-area">Vehicle 807,000 units ( 2013 ) </div><div class="profile-area">Vehicle 760,000 units ( 2012 ) </div><div class="profile-area">Vehicle 805,000 units ( 2011 ) </div>"""soup = BeautifulSoup(html_doc, 'html.parser')
h4_elements = soup.find_all('h4')
profile_areas = soup.find_all('div', attrs={'class': 'profile-area'})
print('\n')
print("++++++++++++++++++++++++++++++++++++")
print("Element counts")
print("++++++++++++++++++++++++++++++++++++")
print("Total H4: {}".format(len(h4_elements)))
print("++++++++++++++++++++++++++++++++++++")
print("Total profile-area: {}".format(len(profile_areas)))
print("++++++++++++++++++++++++++++++++++++")
print('\n')for i in h4_elements:print("++++++++++++++++++++++++++++++++++++")print(i.text.rstrip().lstrip())print("++++++++++++++++++++++++++++++++++++")del profile_areas[0]for j in profile_areas:raw = re.sub('[^A-Za-z0-9]+', ' ', j.text.replace(',','').lstrip().rstrip())raw = raw.rstrip()el = raw.split(' ')print('Type: {} '.format(el[0]))print('Sold: {} {} '.format(el[1], el[2]))print('Year: {} '.format(el[3]))print("++++++++++++++++++++++++++++++++++++")

The output is the following:

 ++++++++++++++++++++++++++++++++++++
Production Capacity (year)
++++++++++++++++++++++++++++++++++++
Type:Vehicle 
Sold: 809000 units 
Year: 2016 
++++++++++++++++++++++++++++++++++++
Type:Vehicle 
Sold: 815000 units 
Year: 2015 
++++++++++++++++++++++++++++++++++++
Type:Vehicle 
Sold: 836000 units 
Year: 2014 
++++++++++++++++++++++++++++++++++++
Type:Vehicle 
Sold: 807000 units 
Year: 2013 
++++++++++++++++++++++++++++++++++++
Type:Vehicle 
Sold: 760000 units 
Year: 2012 
++++++++++++++++++++++++++++++++++++
Type:Vehicle 
Sold: 805000 units 
Year: 2011 
++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++
Output
++++++++++++++++++++++++++++++++++++
Type:Vehicle 
Sold: 815000 units 
Year: 2015 
++++++++++++++++++++++++++++++++++++
Type:Vehicle 
Sold: 836000 units 
Year: 2014 
++++++++++++++++++++++++++++++++++++
Type:Vehicle 
Sold: 807000 units 
Year: 2013 
++++++++++++++++++++++++++++++++++++
Type:Vehicle 
Sold: 760000 units 
Year: 2012 
++++++++++++++++++++++++++++++++++++
Type:Vehicle 
Sold: 805000 units 
Year: 2011 
++++++++++++++++++++++++++++++++++++
https://en.xdnf.cn/q/117975.html

Related Q&A

What should I worry about Python template engines and web frameworks? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, argum…

Value Search from Dictionary via User Input

I have written the following code for getting an output of the various districts located in the given city and their respective postal codes. I want my code to be able to receive input from the user (D…

Read and aggregate data from CSV file

I have a data file with the following format:name,cost1,cost1,cost1,cost2,cost3,cost3, X,2,4,6,5,6,8, Y,0,3,6,5,4,6, . . ....Now, what I would like to do is to convert this to a dictionary of dictionar…

nltk cant using ImportError: cannot import name compat

This is my codeimport nltk freq_dist = nltk.FreqDist(words) print freq_dist.keys()[:50] # 50 most frequent tokens print freq_dist.keys()[-50:] # 50 least frequent tokensAnd I am getting this error mess…

Fitting and Plotting Lognormal

Im having trouble doing something as relatively simple as:Draw N samples from a gaussian with some mean and variance Take logs to those N samples Fit a lognormal (using stats.lognorm.fit) Spit out a n…

Is there any way to install nose in Maya?

Im using Autodesk Maya 2008 on Linux at home, and Maya 2012 on Windows 7 at work. Most of my efforts so far have been focused on the former. I found this thread, and managed to get the setup there work…

Basic python socket server application doesnt result in expected output

Im trying to write a basic server / client application in python, where the clients sends the numbers 1-15 to the server, and the server prints it on the server side console. Code for client:import soc…

creating dictionaries to list order of ranking

I have a list of people and who controls who but I need to combine them all and form several sentences to compute which person control a list of people.The employee order comes from a txt file:

Python: How to use MFdataset in netCDF4

I am trying to read multiple NetCDF files and my code returns the error:ValueError: MFNetCDF4 only works with NETCDF3_* and NETCDF4_CLASSIC formatted files, not NETCDF4. I looked up the documentation a…

Pyspark: Concat function generated columns into new dataframe

I have a pyspark dataframe (df) with n cols, I would like to generate another df of n cols, where each column records the percentage difference b/w consecutive rows in the corresponding, original df co…