Identifying large bodies of text via BeautifulSoup or other python based extractors

2024/10/15 17:24:45

Given some random news article, I want to write a web crawler to find the largest body of text present, and extract it. The intention is to extract the physical news article on the page.

The original plan was to use a BeautifulSoup findAll(True) and to sort each tag by its .getText() value. EDIT: don't use this for html work, use the lxml library, it's python based and much faster than BeautifulSoup. command (which means extract all html tags)

But this won't work for most pages, like the one I listed as an example, because the large body of text is split into many smaller tags, like paragraph dividers for example.

Does anyone have any experience with this? Any help with something like this would be amazing.

At the moment I'm using BeautifulSoup along with python, but willing to explore other possibilities.


EDIT: Came back to this question after a few months later (wow i sounded like an idiot ^), and solved this with a combination of libraries & own code.

Here are some deadly helpful python libraries for the task in sorted order of how much it helped me:

#1 goose library Fast, powerful, consistent #2 readability library Content is passable, slower on average than goose but faster than boilerpipe #3 python-boilerpipe Slower & hard to install, no fault to the boilerpipe library (originally in java), but to the fact that this library is build on top of another library in java, which attributes to IO time & errors, etc.

I'll release benchmarks perhaps if there is interest.


Indirectly related libraries, you should probably install them and read their docs:

  • NLTK text processing library This is too good not to install. They provide text analysis tools along with html tools (like cleanup, etc).
  • lxml html/xml parser Mentioned above. This beats BeautifulSoup in every aspect but usability. It's a bit harder to learn but the results are worth it. HTML parsing takes much less time, it's very noticeable.
  • python webscraper library I think the value of this code isn't the lib itself, but using the lib as a reference manual to build your own crawlers/extractors. It's very nicely coded / documented!

A lot of the value and power in using python, a rather slow language, comes from it's open source libraries. They are especially awesome when combined and used together, and everyone should take advantage of them to solve whatever problems they may have!

Goose library gets lots of solid maintenance, they just added Arabic support, it's great!

Answer

You might look at the python-readability package which does exactly this for you.

https://en.xdnf.cn/q/69257.html

Related Q&A

Pandas reindex and interpolate time series efficiently (reindex drops data)

Suppose I wish to re-index, with linear interpolation, a time series to a pre-defined index, where none of the index values are shared between old and new index. For example# index is all precise times…

How do you set the box width in a plotly box in python?

I currently have the following;y = time_h time_box = Box(y=y,name=Time (hours),boxmean=True,marker=Marker(color=green),boxpoints=all,jitter=0.5,pointpos=-2.0 ) layout = Layout(title=Time Box, ) fig = F…

how do you install django older version using easy_install?

I just broke my environment because of django 1.3. None of my sites are able to run. So, i decided to use virtualenv to set virtual environment with different python version as well as django.But, seem…

Whats difference between findall() and iterfind() of xml.etree.ElementTree

I write a program using just like belowfrom xml.etree.ElementTree import ETxmlroot = ET.fromstring([my xml content])for element in xmlroot.iterfind(".//mytag"):do some thingit works fine on m…

How to convert string dataframe column to datetime as format with year and week?

Sample Data:Week Price 2011-31 1.58 2011-32 1.9 2011-33 1.9 2011-34 1.9I have a dataframe like above and I wanna convert Week column type from string to datetime.My Code:data[Date_Time…

Tensorflow - ValueError: Shape must be rank 1 but is rank 0 for ParseExample/ParseExample

I have a .tfrecords file of the Ubuntu Dialog Corpus. I am trying to read in the whole dataset so that I can split the contexts and utterances into batches. Using tf.parse_single_example I was able to …

Navigating Multi-Dimensional JSON arrays in Python

Im trying to figure out how to query a JSON array in Python. Could someone show me how to do a simple search and print through a fairly complex array please?The example Im using is here: http://eu.bat…

Numpy, apply a list of functions along array dimension

I have a list of functions of the type:func_list = [lambda x: function1(input),lambda x: function2(input),lambda x: function3(input),lambda x: x]and an array of shape [4, 200, 200, 1] (a batch of image…

Database first Django models

In ASP.NET there is entity framework or something called "database first," where entities are generated from an existing database. Is there something similar for Django? I usually work with …

How to use pythons Structural Pattern Matching to test built in types?

Im trying to use SPM to determine if a certain type is an int or an str. The following code: from typing import Typedef main(type_to_match: Type):match type_to_match:case str():print("This is a St…