BeautifulSoup Scraping Results not showing

2024/11/18 10:52:59

I am playing around with BeautifulSoup to scrape data from websites. So I decided to scrape empireonline's website for 100 greatest movies of all time.

Here's the link to the webpage: https://www.empireonline.com/movies/features/best-movies-2/

I imported the HTML from the site quite alright, I was able to use beautiful soup on it. But when I wanted to get the list of the 100 movie titles, I was getting an empty list. Here's the code I wrote below.

import requests
from bs4 import BeautifulSoupURL = "https://www.empireonline.com/movies/features/best-movies-2/"response = requests.get(URL)
top100_webpage = response.textsoup = BeautifulSoup(top100_webpage, "html.parser")
movies = soup.find_all(name="h3", class_="jsx-4245974604")
print(movies)

When I ran the code, the result was an empty list. I changed my parsing library to lxml and html5lib but I was still getting the same empty list.

Please how can I resolve this issue?

Answer

The data you need renders dynamically, however, it's stored as inline JSON. Therefore, we can extract data from there via regular expression. To do that, must look at the page code (Ctrl+U) to find the matches we need and if there are any, try to get them using regular expressions.

This screenshot shows how the page code looks like and the data we need in it:

image

Since there are a lot of matches, we need to use a regular expressions to find the part of the code we need where the list itself will be directly:

#https://regex101.com/r/CqzweN/1
portion_of_script = re.findall("\"Author:42821\"\:{(.*)\"Article:54591\":", str(all_script))

And then we retrieve the list of movies directly:

#https://regex101.com/r/jRgmKA/1
movie_list = re.findall("\"titleText\"\:\"(.*?)\"", str(portion_of_script))

However, we can extract data by converting parsed inline JSON to usable json using json.loads(<variable_that_stores_json_data>) and then access it as we would access a regular dict.

Do not forget that most sites do not like being scraped and the request might be blocked (if using requests as default user-agent in requests library is a python-requests. Additional step could be to rotate user-agent, for example, to switch between PC, mobile, and tablet, as well as between browsers e.g. Chrome, Firefox, Safari, Edge and so on.

You can check the fully working code in online IDE.

from bs4 import BeautifulSoup
import requests, re, json, lxmlheaders = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36",
}html = requests.get("https://www.empireonline.com/movies/features/best-movies-2/", headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")
all_script = soup.select("script")#https://regex101.com/r/CqzweN/1
portion_of_script = re.findall("\"Author:42821\"\:{(.*)\"Article:54591\":", str(all_script)) #https://regex101.com/r/jRgmKA/1
movie_list = re.findall("\"titleText\"\:\"(.*?)\"", str(portion_of_script))    print(json.dumps(movie_list, indent=2, ensure_ascii=False))

Example output

["100) Reservoir Dogs","99) Groundhog Day","98) Paddington 2","97) Amelie","96) Brokeback Mountain","95) Donnie Darko","94) Scott Pilgrim Vs. The World","93) Portrait Of A Lady On Fire","92) Léon","91) Logan","90) The Terminator","89) No Country For Old Men","88) Titanic","87) The Exorcist","86) Black Panther","85) Shaun Of The Dead","84) Lost In Translation","83) Thor: Ragnarok","82) The Usual Suspects","81) Psycho","80) L.A. Confidential","79) E.T. – The Extra Terrestrial","78) In The Mood For Love","77) Star Wars: Return Of The Jedi","76) Arrival","75) A Quiet Place","74) Trainspotting","73) Mulholland Drive","72) Rear Window","71) Up","70) Spider-Man: Into The Spider-Verse","69) Inglourious Basterds","68) Lady Bird","67) Singin\\' In The Rain","66) One Flew Over The Cuckoo\\'s Nest",# ...
]
https://en.xdnf.cn/q/120088.html

Related Q&A

How to verify username and password from CSV file in Python?

I am doing a Python project where I have to verify my username and password from a csv file where the first two rows and columns have the username and password as hi.Current Code: answer = input("…

adding validation to answer in quiz gives wrong answers

I am a complete novice with Python and working on a multiple choice quiz that reads questions from a file and keeps a score that then writes to a file. Everything was working perfectly until I added v…

Why do I get None as the output from a print statement? [duplicate]

This question already has answers here:Why is "None" printed after my functions output?(7 answers)Closed 2 years ago.def print_name(name):print(name)print(print_name(Annabel Lee))Why do I ge…

How to collect tweets about an event that are posted on specific date using python?

I wish to collect all tweets containing specific keywords(ex:manchesterattack,manchester) that are posted about the manchester attack from 22may. Can anyone provide me a code to collect tweets using py…

Pivoting a One-Hot-Encode Dataframe

I have a pandas dataframe that looks like this:genres.head()Drama Comedy Action Crime Romance Thriller Adventure Horror Mystery Fantasy ... History Music War Documentary Sport Musical W…

How to declare multiple similar variables in python? [duplicate]

This question already has answers here:How do I create variable variables?(18 answers)Closed 5 years ago.How can I declare multiple (about 50) variables that count from slider1 to slider50 ? Is there…

what does means this error broken pipe? [duplicate]

This question already has answers here:Closed 11 years ago.Possible Duplicate:TCP client-server SIGPIPE I would like know what does this error mean?

Apply a function to each element of a pandas series

I am trying to tokenize each sentence of my pandas series. I try to do as I see in the documentation, using apply, but didnt work:x.apply(nltk.word_tokenize)If I just use nltk.word_tokenize(x) didnt wo…

ValueError: could not convert string to float: in Python 3.10

When someone writes a string or a letter, I want the code make them go back, and the code to print "must be a number and bigger than 0 and less than 100", but what actually happens is the cod…

How do I access Class fields in Python Graph-Tool property maps?

Im trying to draw a graph with a class as a vertex property. How do I draw the graph with the vertex_text set to the name field of the classes they contain?from graph_tool.all import *class Node(objec…