Question 1

I am scraping this website and get "title" and "category" as text using .get_text().strip().

I have a problem using the same approach for extracting the "author" as text.

data2 = {'url' : [],'title' : [],'category': [],'author': [],
} url_pattern = "https://www.nature.com/nature/articles?searchType=journalSearch&sort=PubDate&year=2018&page={}"
count_min = 1
count_max = 3while count_min <= count_max: print (count_min)url = url_pattern.format(count_min)r = requests.get(url)try: soup = BeautifulSoup(r.content, 'lxml')for links in soup.find_all('article'):data2['url'].append(links.a.attrs['href']) data2['title'].append(links.h3.get_text().strip())data2["category"].append(links.span.get_text().strip()) data2["author"].append(links.find('span', {"itemprop": "name"}).get_text().strip()) #??????except Exception as exc:print(exc.__class__.__name__, exc)time.sleep(0.1)count_min = count_min + 1print ("Fertig.")
df = pd.DataFrame( data2 )
df

df is supposed to print a table with "author", "category", "title", "url". The print Exception gives me the following hint: AttributeError 'NoneType' object has no attribute 'get_text'. But instead of the table I get the following message.

ValueError                                Traceback (most recent call last)
<ipython-input-34-9bfb92af1135> in <module>()29 30 print ("Fertig.")
---> 31 df = pd.DataFrame( data2 )32 df~/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py in __init__(self, data, index, columns, dtype, copy)328                                  dtype=dtype, copy=copy)329         elif isinstance(data, dict):
--> 330             mgr = self._init_dict(data, index, columns, dtype=dtype)331         elif isinstance(data, ma.MaskedArray):332             import numpy.ma.mrecords as mrecords~/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py in _init_dict(self, data, index, columns, dtype)459             arrays = [data[k] for k in keys]460 
--> 461         return _arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)462 463     def _init_ndarray(self, values, index, columns, dtype=None, copy=False):~/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py in _arrays_to_mgr(arrays, arr_names, index, columns, dtype)6161     # figure out the index, if necessary6162     if index is None:
-> 6163         index = extract_index(arrays)6164     else:6165         index = _ensure_index(index)~/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py in extract_index(data)6209             lengths = list(set(raw_lengths))6210             if len(lengths) > 1:
-> 6211                 raise ValueError('arrays must all be same length')6212 6213             if have_dicts:ValueError: arrays must all be same length

How can I improve my code to get the "author" names extracted?

Question 2

You're very close--there's a couple of things I recommend. First, I'd recommend taking a closer look at the HTML--in this case the author names are actually in a ul, where each li contains a span where itemprop is 'name'. However, not all articles have any author names at all. In this case, with your current code, the call to links.find('div', {'itemprop': 'name'}) returns None. None, of course, has no attribute get_text. This means that line will throw an error, which in this case will just cause no value to be appended to the data2 'author' list. I'd recommend storing the author(s) in a list like so:

authors = []
ul = links.find('ul', itemprop='creator')
for author in ul.find_all('span', itemprop='name'):authors.append(author.text.strip())
data2['authors'].append(authors)

This handles the case where there are no authors as we would expect, by "authors" being an empty list.

As a side note, putting your code inside a

try:...
except:pass

construct is generally considered poor practice, for exactly the reason you're seeing now. Ignoring errors silently can give your program the appearance of running properly, while in fact any number of things could be going wrong. At the very least it's rarely a bad idea to print error info to stdout. Even just doing something like this is better than nothing:

try:...
except Exception as exc:print(exc.__class__.__name__, exc)

For debugging, however, having the full traceback is often desirable as well. For this you can use the traceback module.

import traceback
try:...
except:traceback.print_exc()

Error extracting text from website: AttributeError NoneType object has no attribute get_text

Related Q&A

Fastest way to extract tar files using Python

Python - Split a string but keep contiguous uppercase letters [duplicate]

Python: Find a Sentence between some website-tags using regex

How to download all the href (pdf) inside a class with python beautiful soup?

Reducing the complexity/computation time for a basic graph formula [closed]

Find All Possible Fixed Size String Python

What is the concept of namespace when importing a function from another module?

How to pass a literal value to a kedro node? [closed]

How to Loop a List and Extract required data (Beautiful Soup)

square root without pre-defined function in python