Extract text between two different tags beautiful soup

2024/10/9 12:33:39

I'm trying to extract the text content of the article from this web page.

I'm just trying to extract the article content and not the "About the author part".

The problem is that all the content aren't within a tag like <div>. Hence I can't extract them since all are within <p> tags. And when I extract all the <p> tags I also get the "About the author" part. I have to scrape many pages from this website. Is there a way to do this using beautiful soup?

I'm currently trying:

p_tags=soup.find_all('p')
for row in p_tags:print(row)
Answer

All the paragraphs that you want are located inside the <div class="td-post-content"> tag along with the paragraphs for the author information. But, the required <p> tags are direct children of this <div> tag, while the other not required <p> tags are not direct children (they are nested inside other div tags).

So, you can use recursive=False to access those tags only.

Code:

import requests
from bs4 import BeautifulSoupheaders = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}r = requests.get('https://www.the-blockchain.com/2018/06/29/mcafee-labs-report-6x-increase-in-crypto-mining-malware-incidents-in-q1-2018/', headers=headers)
soup = BeautifulSoup(r.text, 'lxml')container = soup.find('div', class_='td-post-content')
for para in container.find_all('p', recursive=False):print(para.text)

Output:

Cybersecurity giant McAfee released its McAfee Labs Threat Report: June 2018 on Wednesday, outlining the growth and trends of new malware and cyber threats in Q1 2018. According to the report, coin mining malware saw a 623 percent growth in the first quarter of 2018, infecting 2.9 million machines in that period. McAfee Labs counted 313 publicly disclosed security incidents in the first three months of 2018, a 41 percent increase over the previous quarter. In particular, incidents in the healthcare sector rose 57 percent, with a significant portion involving Bitcoin-based ransomware that healthcare institutions were often compelled to pay.
Chief Scientist at McAfee Raj Samani said, “There were new revelations this quarter concerning complex nation-state cyber-attack campaigns targeting users and enterprise systems worldwide. Bad actors demonstrated a remarkable level of technical agility and innovation in tools and tactics. Criminals continued to adopt cryptocurrency mining to easily monetize their criminal activity.”
Sizeable criminal organizations are responsible for many of the attacks in recent months. In January, malware dubbed Golden Dragon attacked organizations putting together the Pyeongchang Winter Olympics in South Korea, using a malicious word attachment to install a script that would encrypt and send stolen data to an attacker’s command center. The Lazarus cybercrime ring launched a highly sophisticated Bitcoin phishing campaign called HaoBao that targeted global financial organizations, sending an email attachment that would scan for Bitcoin activity, credentials and mining data.
Chief Technology Officer at McAfee Steve Grobman said, “Cybercriminals will gravitate to criminal activity that maximizes their profit. In recent quarters we have seen a shift to ransomware from data-theft,  as ransomware is a more efficient crime. With the rise in value of cryptocurrencies, the market forces are driving criminals to crypto-jacking and the theft of cryptocurrency. Cybercrime is a business, and market forces will continue to shape where adversaries focus their efforts.”
https://en.xdnf.cn/q/70023.html

Related Q&A

Add column to pandas without headers

How does one append a column of constant values to a pandas dataframe without headers? I want to append the column at the end.With headers I can do it this way:df[new] = pd.Series([0 for x in range(le…

Replace NaN values of pandas.DataFrame with values from list

In a python script using the library pandas, I have a dataset of lets say 100 lines with a feature "X", containing 36 NaN values, and a list of size 36.I want to replace all the 36 missing va…

Boring Factorials in python

I am trying to understand and solve the following problem :Sameer and Arpit want to overcome their fear of Maths and so they have been recently practicing Maths problems a lot. Aman, their friendhas be…

The flask host adress in docker run

I want to run a flask application in Docker, with the flask simple http server. (Not gunicorn)I got a host setting problem. In the flask app.py, it should be work as the official tutorial, but it doesn…

Extracting text from pdf using Python and Pypdf2

I want to extract text from pdf file using Python and PYPDF package. This is my pdf fie and this is my code:import PyPDF2 opened_pdf = PyPDF2.PdfFileReader(test.pdf, rb)p=opened_pdf.getPage(0)p_text= p…

Is it possible to change turtles pen stroke?

I need to draw a bar graph using Pythons turtle graphics and I figured it would be easier to simply make the pen a thick square so I could draw the bars like that and not have to worry about making doz…

How to make a local Pypi mirror without internet access and with search available?

Im trying to make a complete local Pypi repository mirror with pip search feature on a server I can only connect an external hard drive to. To be clear, I dont want a simple caching system, the server …

Turn an application or script into a shell command

When I want to run my python applications from commandline (under ubuntu) I have to be in the directory where is the source code app.py and run the application with commandpython app.pyHow can I make i…

pytest - monkeypatch keyword argument default

Id like to test the default behavior of a function. I have the following:# app/foo.py DEFAULT_VALUE = hellodef bar(text=DEFAULT_VALUE):print(text)# test/test_app.py import appdef test_app(monkeypatch):…

How remove a program installed with distutils?

I have installed a python application with this setup.py:#!/usr/bin/env pythonfrom distutils.core import setup from libyouandme import APP_NAME, APP_DESCRIPTION, APP_VERSION, APP_AUTHORS, APP_HOMEPAGE,…