XPATH for Scrapy

2024/10/14 20:22:56

So i am using SCRAPY to scrape off the books of a website.

I have the crawler working and it crawls fine, but when it comes to cleaning the HTML using the select in XPATH it is kinda not working out right. Now since it is a book website, i have almost 131 books on each page and their XPATH comes to be likes this

For example getting the title of the books -

1st Book --- > /html/body/div/div[3]/div/div/div[2]/div/ul/li/a/span
2nd Book --->  /html/body/div/div[3]/div/div/div[2]/div/ul/li[2]/a/span 
3rd book --->  /html/body/div/div[3]/div/div/div[2]/div/ul/li[3]/a/span 

The DIV[] number increases with the book. I am not sure how to get this into a loop, so that it catches all the titles. I have to do this for Images and Author names too, but i think it will be similar. Just need to get this initial one done.

Thanks for your help in advance.

Answer

There are different ways to get this

  1. Best to select multiple nodes is, selecting on the basis of ids or class. e.g:

    sel.xpath("//div[@id='id']")
    
  2. You can select like this

    for i in range(0, upto_num_of_divs):list = sel.xpath("//div[%s]" %i)
    
  3. You can select like this

    for i in range(0, upto_num_of_divs):list = sel.xpath("//div[position > =1 and position() < upto_num_of_divs])
    
https://en.xdnf.cn/q/117915.html

Related Q&A

Kivy Removing elements from a Stack- / GridLayout

I made a pop-up. It is basically some rows of options (max. 5 rows).If I press the + button, there will be a new line of options.If I press the - button the last row should diasappear. Unfortunatelly i…

Bad timing when playing audio files with PyGame

When I play a sound every 0.5 second with PyGame:import pygame, timepygame.mixer.init() s = pygame.mixer.Sound("2.wav")for i in range(8):pygame.mixer.Channel(i).play(s)time.sleep(0.5)it doesn…

Using read_batch_record_features with an Estimator

(Im using tensorflow 1.0 and Python 2.7)Im having trouble getting an Estimator to work with queues. Indeed, if I use the deprecated SKCompat interface with custom data files and a given batch size, the…

How to insert integers into a list without indexing using python?

I am trying to insert values 0 - 9 into a list without indexing. For example if I have the list [4, 6, X, 9, 0, 1, 5, 7] I need to be able to insert the integers 0 - 9 into the placeholder X and test i…

Separate/reposition/translate shapes in image with pillow in python

I need to separate or translate or replace pixels in an image with python so as to all the shapes to share the same distance between each other and the limits of the canvas.background is white, shapes …

django.db.utils.OperationalError: (1045, Access denied for user user@localhost

I cant get my Django project to load my database correctly. It throws this error. Im running MariaDB with Django, and I uninstalled all MySQL I added the user by running:create database foo_db; create …

Paho Python Client with HiveMQ

i am developing a module in python that will allow me to connect my raspberry pi to a version of hivemq hosted on my pc.it connects normally but when i add hivemqs file auth plugin it doesnt seem to wo…

Comparison of value items in a dictionary and counting matches

Im using Python 2.7. Im trying to compare the value items in a dictionary.I have two problems. First is the iteration of values in a dictionary with a length of 1. I always get an error, because python…

how to send cookies inside post request

trying to send Post request with the cookies on my pc from get request #! /usr/bin/python import re #regex import urllib import urllib2 #get request x = urllib2.urlopen("http://www.example.com) #…

Flask-Uploads gives AttributeError?

from flask import Flask from flask.ext.uploads import UploadSet, configure_uploads, IMAGESapp = Flask(__name__)app.config[UPLOADED_PHOTOS_DEST] = /home/kevin photos = UploadSet(photos, IMAGES)configure…