Error extracting text from website: AttributeError NoneType object has no attribute get_text

2024/7/7 6:22:39

I am scraping this website and get "title" and "category" as text using .get_text().strip().

I have a problem using the same approach for extracting the "author" as text.

data2 = {'url' : [],'title' : [],'category': [],'author': [],
} url_pattern = "https://www.nature.com/nature/articles?searchType=journalSearch&sort=PubDate&year=2018&page={}"
count_min = 1
count_max = 3while count_min <= count_max: print (count_min)url = url_pattern.format(count_min)r = requests.get(url)try: soup = BeautifulSoup(r.content, 'lxml')for links in soup.find_all('article'):data2['url'].append(links.a.attrs['href']) data2['title'].append(links.h3.get_text().strip())data2["category"].append(links.span.get_text().strip()) data2["author"].append(links.find('span', {"itemprop": "name"}).get_text().strip()) #??????except Exception as exc:print(exc.__class__.__name__, exc)time.sleep(0.1)count_min = count_min + 1print ("Fertig.")
df = pd.DataFrame( data2 )
df

df is supposed to print a table with "author", "category", "title", "url". The print Exception gives me the following hint: AttributeError 'NoneType' object has no attribute 'get_text'. But instead of the table I get the following message.

ValueError                                Traceback (most recent call last)
<ipython-input-34-9bfb92af1135> in <module>()29 30 print ("Fertig.")
---> 31 df = pd.DataFrame( data2 )32 df~/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py in __init__(self, data, index, columns, dtype, copy)328                                  dtype=dtype, copy=copy)329         elif isinstance(data, dict):
--> 330             mgr = self._init_dict(data, index, columns, dtype=dtype)331         elif isinstance(data, ma.MaskedArray):332             import numpy.ma.mrecords as mrecords~/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py in _init_dict(self, data, index, columns, dtype)459             arrays = [data[k] for k in keys]460 
--> 461         return _arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)462 463     def _init_ndarray(self, values, index, columns, dtype=None, copy=False):~/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py in _arrays_to_mgr(arrays, arr_names, index, columns, dtype)6161     # figure out the index, if necessary6162     if index is None:
-> 6163         index = extract_index(arrays)6164     else:6165         index = _ensure_index(index)~/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py in extract_index(data)6209             lengths = list(set(raw_lengths))6210             if len(lengths) > 1:
-> 6211                 raise ValueError('arrays must all be same length')6212 6213             if have_dicts:ValueError: arrays must all be same length 

How can I improve my code to get the "author" names extracted?

Answer

You're very close--there's a couple of things I recommend. First, I'd recommend taking a closer look at the HTML--in this case the author names are actually in a ul, where each li contains a span where itemprop is 'name'. However, not all articles have any author names at all. In this case, with your current code, the call to links.find('div', {'itemprop': 'name'}) returns None. None, of course, has no attribute get_text. This means that line will throw an error, which in this case will just cause no value to be appended to the data2 'author' list. I'd recommend storing the author(s) in a list like so:

authors = []
ul = links.find('ul', itemprop='creator')
for author in ul.find_all('span', itemprop='name'):authors.append(author.text.strip())
data2['authors'].append(authors)

This handles the case where there are no authors as we would expect, by "authors" being an empty list.

As a side note, putting your code inside a

try:...
except:pass

construct is generally considered poor practice, for exactly the reason you're seeing now. Ignoring errors silently can give your program the appearance of running properly, while in fact any number of things could be going wrong. At the very least it's rarely a bad idea to print error info to stdout. Even just doing something like this is better than nothing:

try:...
except Exception as exc:print(exc.__class__.__name__, exc)

For debugging, however, having the full traceback is often desirable as well. For this you can use the traceback module.

import traceback
try:...
except:traceback.print_exc()
https://en.xdnf.cn/q/119938.html

Related Q&A

Fastest way to extract tar files using Python

I have to extract hundreds of tar.bz files each with size of 5GB. So tried the following code:import tarfile from multiprocessing import Poolfiles = glob.glob(D:\\*.tar.bz) ##All my files are in D for …

Python - Split a string but keep contiguous uppercase letters [duplicate]

This question already has answers here:Splitting on group of capital letters in python(3 answers)Closed 3 years ago.I would like to split strings to separate words by capital letters, but if it contain…

Python: Find a Sentence between some website-tags using regex

I want to find a sentence between the ...class="question-hyperlink"> tags. With this code:import urllib2 import reresponse = urllib2.urlopen(https://stackoverflow.com/questions/tagged/pyth…

How to download all the href (pdf) inside a class with python beautiful soup?

I have around 900 pages and each page contains 10 buttons (each button has pdf). I want to download all the pdfs - the program should browse to all the pages and download the pdfs one by one. Code only…

Reducing the complexity/computation time for a basic graph formula [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.Want to improve this question? Add details and clarify the problem by editing this post.Closed 4 years ago.Improve…

Find All Possible Fixed Size String Python

Problem: I want to generate all possible combination from 36 characters that consist of alphabet and numbers in a fixed length string. Assume that the term "fixed length" is the upper bound f…

What is the concept of namespace when importing a function from another module?

main.py:from module1 import some_function x=10 some_function()module1.py:def some_function():print str(x)When I execute the main.py, it gives an error in the moduel1.py indicating that x is not availab…

How to pass a literal value to a kedro node? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.Want to improve this question? Add details and clarify the problem by editing this post.Closed 4 years ago.This po…

How to Loop a List and Extract required data (Beautiful Soup)

I need help in looping a list and extracting the src links. This is my list and the code: getimages = getDetails.find_all(img) #deleting the first image in the list getimages[0].decompose() print(getim…

square root without pre-defined function in python

How can one find the square root of a number without using any pre-defined functions in python?I need the main logic of how a square root of a program works. In general math we will do it using HCF bu…