How to get the proper link from a website using python beautifulsoup?

2024/10/6 20:33:30

When I try to scrape roster links, I get https://gwsports.com/roster.aspx?path=wpolo when I open it on chrome it changes to https://gwsports.com/sports/mens-water-polo/roster. I want to scrape it in proper format like the second one(https://gwsports.com/sports/mens-water-polo/roster).

pip install -U gazpachofrom gazpacho import get, Soupurl = 'https://gwsports.com'
html = get(url)
soup = Soup(html)
links = soup.find('a', {'href': "roster"}, partial=True)
s=[link.attrs['href'] for link in links]
print(s)
Answer

This is not an issue with scraping, you're getting the exact URL that's on the page. Rather that URL redirects you to the final URL which is the one you need.
You can use requests library to get the final URL:

import requestsheaders = {'User-Agent': 'Mozilla/5.0 (Macintosh; ' \'Intel Mac OS X 10.6; rv:16.0) Gecko/20100101 Firefox/16.0'}url = 'https://gwsports.com/roster.aspx?path=wpolo'r = requests.get(url, allow_redirects=True, headers=headers)
if r.status_code == 200:print(r.url) # URL after redirections
else:print('Request failed')

Which makes your code like so:

from gazpacho import get, Soup
import requestsdef get_final_url(url, root):# Note this function assumes url is relative and always prepends root# You may want to extend it to detect absolute URLsheaders = {'User-Agent': 'Mozilla/5.0 (Macintosh; ' \'Intel Mac OS X 10.6; rv:16.0) Gecko/20100101 Firefox/16.0'}r = requests.get(url, allow_redirects=True, headers=headers)if r.status_code == 200:return r.url # URL after redirectionselse:raise requests.HTTPErrorurl = 'https://gwsports.com'
root = 'https://gwsports.com'
html = get(url)
soup = Soup(html)
links = soup.find('a', {'href': "roster"}, partial=True)
s = [get_final_url(root + link.attrs['href'], root) for link in links]
print(s)

Output

['https://gwsports.com/sports/baseball/roster', 'https://gwsports.com/sports/mens-basketball/roster', 'https://gwsports.com/sports/mens-golf/roster', 'https://gwsports.com/sports/mens-soccer/roster', 'https://gwsports.com/sports/mens-swimming-and-diving/roster', 'https://gwsports.com/sports/mens-cross-country/roster', 'https://gwsports.com/sports/mens-water-polo/roster', 'https://gwsports.com/sports/womens-basketball/roster', 'https://gwsports.com/sports/womens-gymnastics/roster', 'https://gwsports.com/sports/womens-lacrosse/roster', 'https://gwsports.com/sports/womens-rowing/roster', 'https://gwsports.com/sports/womens-soccer/roster', 'https://gwsports.com/sports/softball/roster', 'https://gwsports.com/sports/womens-swimming-and-diving/roster', 'https://gwsports.com/sports/womens-tennis/roster', 'https://gwsports.com/sports/womens-cross-country/roster', 'https://gwsports.com/sports/womens-volleyball/roster']
https://en.xdnf.cn/q/118912.html

Related Q&A

tkinter frame propagate not behaving?

If you uncomment the options_frame_title you will see that it does not behave properly. Am I missing something? That section was just copied and pasted from the preview_frame_title and that seems to h…

python modules installing Error Visual c++ 14.0 is required [duplicate]

This question already has answers here:pip install ecos errors with "Microsoft Visual C++ 14.0 is required." [duplicate](1 answer)Error "Microsoft Visual C++ 14.0 is required (Unable to …

How to render Flask Web App with Javascript [duplicate]

This question already has answers here:Return JSON response from Flask view(15 answers)How to append ajax html response to next to current div(5 answers)Closed 5 years ago.Edit: Hi Ive checked the dupl…

Use start and stop function with same button in Tkinter

With the help of the command button, I am able to disconnect the frame in Tkinter. But is there any way which helps to use the same button to start also?import tkinter as tk counter = 0 def counter_la…

How to restrict my students to dont access teacher area in django? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.Want to improve this question? Update the question so it focuses on one problem only by editing this post.Closed 4…

invert edge values in python boolean list

I have a list of booleans likel = [False, False, False, True, True, True, False, False, True, False, False]and want to invert every edge value that is False like[True, True, True, True, True, True, Fal…

Unable to write text on mouseclick area on Image

I am trying to draw text on Image where the user clicks. Getting this error:Exception in Tkinter callback Traceback (most recent call last):File "C:\Users\Admin\AppData\Local\Programs\Python\Pytho…

Google Cloud Storage: __init__() got an unexpected keyword argument total_size

I am developping a tool to transcribe interviews for a contract I have. For that I develop a code with the following flow:After input validation, the audio file (in m4a) is converted to wav and stored …

Selenium, Intercept HTTP Request?

Using selenium 4.12 in Python, how can I intercept an HTTP request to see what its body or headers look like? Please Note, that Im not asking for code but rather for resources/ideas of different or su…

Flask server returns 404 on localhost:5000 w/ Twilio

Im following this guide (Python Quickstart: Replying to SMS and MMS Messages) to try and set up a flask server, but when I try to connect to http://localhost:5000 I get a 404 error. I can ping 127.0.0.…