BeautifulSoup get text from tag searching by Title

2024/10/5 20:37:15

I'm scrapping a webpage with python that provides different documents and I want to retrieve some information from them. The document gives the information in two ways, there's this one where it gives it like this: Company name: Company name which is solved in this question, and another one that goes like Title: and then all the text on a separate block of text, here's an example of this second html starting at div with class DocumentBody:

<div class="DocumentBody"><div class="stdoc">...</div><div class="stdoc">...</div><div class="stdoc">...</div><div class="grseq">...</div><div class="grseq"><p class="tigrseq">...</p><div class="mlioccur">...</div><div class="mlioccur"><span class="nomark">...</span><span class="timark">Denomination:</span><div class="txtmark"><p class="p"> </p><p>Information about denomination</p><p></p></div></div></div>
</div>

At first I tried hardcoding the xpath to the text, but the problem is that the html from the documents might change, they are not always the same.

This is an example of what I made to get the denomination:

from lxml import etreeclass LTED:def __init__(self, url, soup):if(not soup)soup = get_soup_from_url(url, "html.parser")dom = etree.HTML(str(soup))# case document it's updated and not a new oneself.corrigenda = bool(soup.body.findAll(text="Corrigenda"))self.denomination = self.get_denomination(dom)def get_denomination(self, dom):if self.corrigenda:item = dom.xpath("//div[@class='DocumentBody']/div[7]/div[2]/div/p[2]")[0].textelse:item = dom.xpath("//div[@class='DocumentBody']/div[6]/div[2]/div/p[2]")[0].textreturn item

As the xpath is hardcoded, this works the majority of the time, but in some cases it gets another text because the html has changed.

How should I retrieve the text in this case? Is there any way to get Information about denomination searching by Denomination?

In case you want to check the webpage, here's a link to an example I'm trying to scrape

Answer

Link do not contain such Denomination but you can adapt and proceed like:

for e in soup.select('span:-soup-contains("Title:") + div'):print(e.get_text(strip=True))

In newer code avoid old syntax findAll() instead use find_all() or select() with css selectors - For more take a minute to check docs

Example

import requests
from bs4 import BeautifulSoupsoup = BeautifulSoup(requests.get('https://ted.europa.eu/udl?uri=TED:NOTICE:628602-2022:TEXT:EN:HTML&tabId=0',headers = {'User-Agent': 'Mozilla/5.0'}).text)for e in soup.select('div.txtmark:-soup-contains("Official name:")'):print(e.next.split(':')[-1].strip())for e in soup.select('span:-soup-contains("Title:") + div'):print(e.get_text(strip=True))

Output

KfW Bankengruppe
Vergabekammer Bund
Management of the PtX-Fund by the Power-to-X D&G GmbH (in formation)
https://en.xdnf.cn/q/119745.html

Related Q&A

Subtract from first value in numpy array [duplicate]

This question already has answers here:Numpy modify array in place?(4 answers)Closed 6 years ago.Having numpy array like that:a = np.array([35,2,160,56,120,80,1,1,0,0,1])I want to subtract custom valu…

how to give range of a worksheet as variable

I am having one excel sheet which is used to read the data through python openpyxl...so in my script i have values that are hard coded as ws[E2:AB3] as AB3 is the last entry to be read...but now the sh…

how to remove brackets from these individual elements? [duplicate]

This question already has answers here:How do I make a flat list out of a list of lists?(32 answers)Closed 2 years ago.This post was edited and submitted for review 2 years ago and failed to reopen th…

First project alarm clock

from tkinter import * from tkinter import ttk from time import strftime import winsoundclock = Tk()clock.title("WhatAClock")clock.geometry("300x400")notebook = ttk.Notebook()tab1_t…

Invalid Syntax using @app.route

Im getting a Invalid Syntax in line 22 @app.route(/start) and really dont know why... Im developing it under a Cloud9 server https://c9.io , maybe that has something to do with it... I tried it in two …

How do I count unique words using counter library in python?

im new to python and trying various librariesfrom collections import Counter print(Counter(like baby baby baby ohhh baby baby like nooo))When i print this the output I receive is:Counter({b: 10, : 8, …

Need some debugging in my program: filling up SQL tables with data retrieved from a Python program

I am filling up SQL tables with data that I have retrieved from a Python program. I am using Visual Studio Code for the Python program and MySQL Workbench 8.0 for SQL. There are some errors in it that …

How do I create a magic square matrix using python

A basket is given to you in the shape of a matrix. If the size of the matrix is N x N then the range of number of eggs you can put in each slot of the basket is 1 to N2 . You task is to arrange the egg…

Ensuring same dimensions in Python

The dimensions of P is (2,3,3). But the dimensions of M is (3,3). How can I ensure that both P and M have the same dimensions i.e. (2,3,3). import numpy as np P=np.array([[[128.22918457, 168.52413295,…

how to stop tkinter timer function when i press button one more times?

id tried to use root.after_cancel(AFTER), but i dont know how.root.after_cancel(AFTER) AFTER = None def countdown(count,time,name):global AFTERtime[text] =name,":",datetime.fromtimestamp(cou…