How to webscrape all shoes on nike page using python

2024/9/20 12:35:28

I am trying to webscrape all the shoes on https://www.nike.com/w/mens-shoes-nik1zy7ok. How do I scrape all the shoes including the shoes that load as you scroll down the page?

The exact information I want to obtain is inside the div elements with the class "product-card__body" as follows:

<div class="product-card__body " data-el-type="Card"><figure><a class="product-card__link-overlay" href="https://www.nike.com/t/air-force-1-07-mens-shoe-TjqcX1/CJ0952-001">Nike Air Force 1 '07</a><a class="product-card__img-link-overlay" href="https://www.nike.com/t/air-force-1-07-mens-shoe-TjqcX1/CJ0952-001" aria-describedby="Nike Air Force 1 '07" data-el-type="Hero"><div class="image-loader css-zrrhrw product-card__hero-image is--loaded"><picture><source srcset="https://static.nike.com/a/images/c_limit,w_592,f_auto/t_product_v1/s12ff321cn2nykxhva9j/air-force-1-07-mens-shoe-TjqcX1.jpg" media="(min-width: 1024px)"><source srcset="https://static.nike.com/a/images/c_limit,w_592,f_auto/t_product_v1/s12ff321cn2nykxhva9j/air-force-1-07-mens-shoe-TjqcX1.jpg" media="(max-width: 1023px) and (-webkit-min-device-pixel-ratio: 2), (min-resolution: 192dpi)"><source srcset="https://static.nike.com/a/images/c_limit,w_318,f_auto/t_product_v1/s12ff321cn2nykxhva9j/air-force-1-07-mens-shoe-TjqcX1.jpg" media="(max-width: 1023px)"><img src="https://static.nike.com/a/images/c_limit,w_318,f_auto/t_product_v1/s12ff321cn2nykxhva9j/air-force-1-07-mens-shoe-TjqcX1.jpg" alt="Nike Air Force 1 '07 Men's Shoe"></picture></div></a><div class="product-card__info"><div class="product_msg_info"><div class="product-card__titles"><div class="product-card__title " id="Nike Air Force 1 '07">Nike Air Force 1 '07</div><div class="product-card__subtitle ">Men's Shoe</div></div></div><div class="product-card__count-wrapper show--all"><div class="product-card__count-item"><button type="button" aria-expanded="false" class="product-card__colorway-btn"><div aria-label="Available in 3 Colors" aria-describedby="Nike Air Force 1 '07" class="product-card__product-count "><span>3 Colors</span></div></button></div></div><div class="product-card__price-wrapper "><div class="product-card__price"><div><div class="product-price css-11s12ax is--current-price" data-test="product-price">$90</div></div></div></div></div></figure></div>

Here is the code I am using:

    html_data = requests.get("https://www.nike.com/w/mens-shoes-nik1zy7ok").textshoes = json.loads(re.search(r'window.INITIAL_REDUX_STATE=(\{.*?\});', html_data).group(1))

Right now it only retrieves the shoes that initially load on the page. How do I get the rest of the shoes as well and append that to the shoes variable?

Answer

By examining the API calls made by the website you can find a cryptic URL starting with https://api.nike.com/. This URL is also stored in the INITIAL_REDUX_STATE that you already used to get the first couple of products. So, I simply extend your approach:

import requests
import json
import re# your product page
uri = 'https://www.nike.com/w/mens-shoes-nik1zy7ok'base_url = 'https://api.nike.com'
session = requests.Session()def get_lazy_products(stub, products):
"""Get the lazily loaded products."""response = session.get(base_url + stub).json()next_products = response['pages']['next']products += response['objects']if next_products:get_lazy_products(next_products, products)return products# find INITIAL_REDUX_STATE
html_data = session.get(uri).text
redux = json.loads(re.search(r'window.INITIAL_REDUX_STATE=(\{.*?\});', html_data).group(1))# find the initial products and the api entry point for the recursive loading of additional products
wall = redux['Wall']
initial_products = re.sub('anchor=[0-9]+', 'anchor=0', wall['pageData']['next'])# find all the products
products = get_lazy_products(initial_products, [])# Optional: filter by id to get a list with unique products
cloudProductIds = set()
unique_products = []
for product in products:try:if not product['id'] in cloudProductIds:cloudProductIds.add(product['id'])unique_products.append(product)except KeyError:print(product)

The api also returns the total number of products, though this number seems to vary and depend on the count parameter in the api`s URL.

Do you need help parsing or aggregating the results?

https://en.xdnf.cn/q/119365.html

Related Q&A

Pyo in Python: name Server not defined

I recently installed Pyo, and I entered Python 3.6 and typedfrom pyo import * s = Server().boot() s.start() sf = SfPlayer("C:\Users\myname\Downloads\wot.mp3", speed=1, loop=True).out()but I …

Limited digits with str.format(), and then only when they matter

If were printing a dollar amount, we usually want to always display two decimal digits.cost1, cost2 = 123.456890123456789, 357.000 print {c1:.2f} {c2:.2f}.format(c1=cost1, c2=cost2)shows123.46 357.00…

How is covariance implemented internally in numpy?

This is the definition of a covariance matrix. http://en.wikipedia.org/wiki/Covariance_matrix#DefinitionEach element in the matrix, except in the principal diagonal, (if I am not wrong) simplifies to E…

Pulling excel rows to display as a grid in tkinter

I am imaging fluorescent cells from a 384-well plate and my software spits out a formatted excel analysis of the data (16 rowsx24 columns of images turns into a list of data, with 2 measurements from e…

Django Migrating DB django.db.utils.ProgrammingError: relation django_site does not exist

Doing a site upgrade for Django, now pushing it to the server when I try python manage.py makemigrations I get this error (kpsga) sammy@kpsga:~/webapps/kpsga$ python manage.py makemigrations Traceback …

list intersection algorithm implementation only using python lists (not sets)

Ive been trying to write down a list intersection algorithm in python that takes care of repetitions. Im a newbie to python and programming so forgive me if this sounds inefficient, but I couldnt come …

In keras\tensorflow, How adding CNN layers to last layer of ResNet50V2 that pre-train on imagenet

I am trying to drop the last layer and add a simple CNN instead like the following, model = Sequential() base_model = ResNet50V2(include_top=False, weights="imagenet", input_shape=input_shape…

How to get missing date in columns using python pandas [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.Want to improve this question? Add details and clarify the problem by editing this post.Closed 3 years ago.Improve…

vigenere cipher - not adding correct values

I want to get specific values from a for loop to add to another string to create a vigenere cipher.heres the code.userinput = input(enter message) keyword = input(enter keyword) new = for a in keyword…

Why isnt my output returning as expected?

So I wrote this code def diagsDownRight(M):n = len(M)m = [[] * (n - i - 1) + row + [] * i for i, row in enumerate(M)]return ([.join(col) for col in zip(*m)]), [.join(col[::-1]) for col in zip(*m)] def …