When I try to scrape roster links, I get https://gwsports.com/roster.aspx?path=wpolo when I open it on chrome it changes to https://gwsports.com/sports/mens-water-polo/roster. I want to scrape it in proper format like the second one(https://gwsports.com/sports/mens-water-polo/roster).
pip install -U gazpachofrom gazpacho import get, Soupurl = 'https://gwsports.com'
html = get(url)
soup = Soup(html)
links = soup.find('a', {'href': "roster"}, partial=True)
s=[link.attrs['href'] for link in links]
print(s)
This is not an issue with scraping, you're getting the exact URL that's on the page. Rather that URL redirects you to the final URL which is the one you need.
You can use requests
library to get the final URL:
import requestsheaders = {'User-Agent': 'Mozilla/5.0 (Macintosh; ' \'Intel Mac OS X 10.6; rv:16.0) Gecko/20100101 Firefox/16.0'}url = 'https://gwsports.com/roster.aspx?path=wpolo'r = requests.get(url, allow_redirects=True, headers=headers)
if r.status_code == 200:print(r.url) # URL after redirections
else:print('Request failed')
Which makes your code like so:
from gazpacho import get, Soup
import requestsdef get_final_url(url, root):# Note this function assumes url is relative and always prepends root# You may want to extend it to detect absolute URLsheaders = {'User-Agent': 'Mozilla/5.0 (Macintosh; ' \'Intel Mac OS X 10.6; rv:16.0) Gecko/20100101 Firefox/16.0'}r = requests.get(url, allow_redirects=True, headers=headers)if r.status_code == 200:return r.url # URL after redirectionselse:raise requests.HTTPErrorurl = 'https://gwsports.com'
root = 'https://gwsports.com'
html = get(url)
soup = Soup(html)
links = soup.find('a', {'href': "roster"}, partial=True)
s = [get_final_url(root + link.attrs['href'], root) for link in links]
print(s)
Output
['https://gwsports.com/sports/baseball/roster', 'https://gwsports.com/sports/mens-basketball/roster', 'https://gwsports.com/sports/mens-golf/roster', 'https://gwsports.com/sports/mens-soccer/roster', 'https://gwsports.com/sports/mens-swimming-and-diving/roster', 'https://gwsports.com/sports/mens-cross-country/roster', 'https://gwsports.com/sports/mens-water-polo/roster', 'https://gwsports.com/sports/womens-basketball/roster', 'https://gwsports.com/sports/womens-gymnastics/roster', 'https://gwsports.com/sports/womens-lacrosse/roster', 'https://gwsports.com/sports/womens-rowing/roster', 'https://gwsports.com/sports/womens-soccer/roster', 'https://gwsports.com/sports/softball/roster', 'https://gwsports.com/sports/womens-swimming-and-diving/roster', 'https://gwsports.com/sports/womens-tennis/roster', 'https://gwsports.com/sports/womens-cross-country/roster', 'https://gwsports.com/sports/womens-volleyball/roster']