How to retrieve nested data with BeautifulSoup?

2024/10/5 17:21:03

I have the below webpage source:

</li><li class="cl-static-search-result" title="BELLO HONDA ACCORD &quot;95 MIL MILLAS&quot;. REALMENTE COMO NUEVO"><a href="link1"><div class="title">BELLO HONDA ACCORD &quot;95 MIL MILLAS&quot;. REALMENTE COMO NUEVO</div><div class="details"><div class="price">$4,600</div><div class="location">Miami</div></div></a></li><li class="cl-static-search-result" title="Honda Element"><a href=" link2 "><div class="title">Honda Element</div><div class="details"><div class="price">$4,950</div><div class="location">Coral springs</div></div></a></li><li class="cl-static-search-result" title="Mint Jeep"><a href=" link3 "><div class="title">Mint Jeep</div><div class="details"><div class="price">$8,500</div><div class="location">Pompano</div></div></a></li>

I need to extract the data as below:

| URL  | TITLE               | PRICE  |
| ---- | ------------------- | ------ |
| link1 | BELLO HONDA ACCORD | $4,600 |
| link2 | Honda Element      | $4,950 |
| link3 | Mint Jeep          | $8,500 |

I am able to extract the URL names. When I attempt to get the title and price, it seems I am entering a loop that get the title/price for the full page after each URL link I get. Below is my code:

from urllib import request 
from bs4 import BeautifulSoup
from lxml import etree
import csv
page_url = 'URLNAME'
rawpage = request.urlopen(page_url)soup = BeautifulSoup(rawpage, 'html5lib')links_list = []for link in soup.find_all('a'):              try:url = link.get('href')for div in soup.find_all('div', attrs={'class':'title'}):title = div.textprint (title)links_list.append({'url': url})# if the row is missing anything...except AttributeError:#....skip it, dont blow up.pass# save it to csvwith open('links.csv', 'w', newline='') as csv_out:csv_writer = csv.writer(csv_out)# Creta the header rowscsv_writer.writerow(['url', 'title'])for row in links_list:csv_writer.writerow([str(row['url'])])
Answer

Try to change your strategy selecting / iterating elements and may use css selectors:

...
data = []
soup = BeautifulSoup(html)
for e in soup.select('li[title]'):data.append({'link':e.a.get('href'),'title':e.get('title'),'price': e.select_one('.price').get_text()})
data

Process the list of dicts to write your file or create a dataframe, ...

Example

from bs4 import BeautifulSoup
html = '''
<li class="cl-static-search-result" title="BELLO HONDA ACCORD &quot;95 MIL MILLAS&quot;. REALMENTE COMO NUEVO"><a href="link1"><div class="title">BELLO HONDA ACCORD &quot;95 MIL MILLAS&quot;. REALMENTE COMO NUEVO</div><div class="details"><div class="price">$4,600</div><div class="location">Miami</div></div></a></li><li class="cl-static-search-result" title="Honda Element"><a href=" link2 "><div class="title">Honda Element</div><div class="details"><div class="price">$4,950</div><div class="location">Coral springs</div></div></a></li><li class="cl-static-search-result" title="Mint Jeep"><a href=" link3 "><div class="title">Mint Jeep</div><div class="details"><div class="price">$8,500</div><div class="location">Pompano</div></div></a></li>
'''
data = []
soup = BeautifulSoup(html)
for e in soup.select('li[title]'):data.append({'link':e.a.get('href'),'title':e.get('title'),'price': e.select_one('.price').get_text()})
data
https://en.xdnf.cn/q/119905.html

Related Q&A

applying onehotencoder on numpy array

I am applying OneHotEncoder on numpy array.Heres the codeprint X.shape, test_data.shape #gives 4100, 15) (410, 15) onehotencoder_1 = OneHotEncoder(categorical_features = [0, 3, 4, 5, 6, 8, 9, 11, 12]) …

How to delete temp folder data using python script [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.Want to improve this question? Update the question so it focuses on one problem only by editing this post.Closed 6…

Save a list of objects on exit of pygame game [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.Want to improve this question? Add details and clarify the problem by editing this post.Closed 10 years ago.Improv…

Trying to make loop for a function that stops after the result is lower than a certain value

Im taking a beginner python class and part of an exercise we were given was this:The point x with the property x= sin(x)−ax+ 30 is called a fixed point of the function f(x) = sin(x)−ax+ 30. It can b…

python url extract from html

I need python regex to extract urls from html, example html code :<a href=""http://a0c5e.site.it/r"" target=_blank><font color=#808080>MailUp</font></a> <…

Regex match each character at least once [closed]

Its difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying thi…

How to cluster with K-means, when number of clusters and their sizes are known [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.Want to improve this question? Update the question so it focuses on one problem only by editing this post.Closed 1…

Converting German characters (like , etc) from Mac Roman to UTF (or similar)?

I have a CSV file which I can read in and at all works fine except for the specific German (and possibly other) characters. Ive used chardet to determine that the encoding is Mac Roman import chardetde…

Caesar cipher without knowing the Key

Hey guys if you look at my code below you will be able to see that i was able to create a program that can open a file decode the content of the file and save it into another file but i need to input t…

how to convert u\uf04a to unicode in python [duplicate]

This question already has answers here:Python unicode codepoint to unicode character(4 answers)Closed 2 years ago.I am trying to decode u\uf04a in python thus I can print it without error warnings. In …