How to download all the href (pdf) inside a class with python beautiful soup?

2024/10/5 19:31:29

I have around 900 pages and each page contains 10 buttons (each button has pdf). I want to download all the pdf's - the program should browse to all the pages and download the pdfs one by one.

Code only searching for .pdf but my href does not have .pdf page_no (1 to 900).

https://bidplus.gem.gov.in/bidlists?bidlists&page_no=3

This is the website and below is the link:

BID NO: GEM/2021/B/1804626

import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoupurl = "https://bidplus.gem.gov.in/bidlists"#If there is no such folder, the script will create one automatically
folder_location = r'C:\webscraping'
if not os.path.exists(folder_location):os.mkdir(folder_location)response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")
for link in soup.select("a[href$='.pdf']"):#Name the pdf files using the last portion of each link which are unique in this casefilename = os.path.join(folder_location,link['href'].split('/')[-1])with open(filename, 'wb') as f:f.write(requests.get(urljoin(url,link['href'])).content)
Answer

You only need the href as associated with the links you call buttons. Then prefix with the appropriate protocol + domain.

The links can be matched with the following selector:

.bid_no > a

That is anchor (a) tags with direct parent element having class bid_no.

This should pick up 10 links per page. As you will need a file name for each download I suggest having a global dict, which you store the links as values and link text as keys in. I would replace the "\" in the link descriptions with "_". You simply add to this during your loop over the desired number of pages.

An example of some of the dictionary entries:

enter image description here


As there are over 800 pages I have chosen to add in an additional termination page count variable called end_number. I don't want to loop to all pages so this allows me an early exit. You can remove this param if so desired.

Next, you need to determine the actual number of pages. For this you can use the following css selector to get the Last pagination link and then extract its data-ci-pagination-page value and convert to integer. This can then be the num_pages (number of pages) to terminate your loop at:

.pagination li:last-of-type > a

That looks for an a tag which is a direct child of the last li element, where those li elements have a shared parent with class pagination i.e. the anchor tag in the last li, which is the last page link in the pagination element.

Once you have all your desired links and file suffixes (the description text for the links) in your dictionary, loop the key, value pairs and issue requests for the content. Write that content out to disk.


TODO:

I would suggest you look at ways of optimizing the final issuing of requests and writing out to disk. For example, you could first issue all requests asynchronously and store in a dictionary to optimize what would be an I/0-bound process. Then loop that writing to disk perhaps with a multi-processing approach to optimize for a more CPU-bound process.

I would additionally consider if some sort of wait should be introduced between requests. Or if requests should be batches. You could theoretically currently have something like (836 * 10) + 836 requests.


import requests
from bs4 import BeautifulSoup as bsend_number = 3
current_page = 1
pdf_links = {}
path = '<your path>'with requests.Session() as s:while True:r = s.get(f'https://bidplus.gem.gov.in/bidlists?bidlists&page_no={current_page}')soup = bs(r.content, 'lxml')for i in soup.select('.bid_no > a'):pdf_links[i.text.strip().replace('/', '_')] = 'https://bidplus.gem.gov.in' + i['href']#print(pdf_links)if current_page == 1:num_pages = int(soup.select_one('.pagination li:last-of-type > a')['data-ci-pagination-page'])print(num_pages)if current_page == num_pages or current_page > end_number:breakcurrent_page+=1for k,v in pdf_links.items():with open(f'{path}/{k}.pdf', 'wb') as f:r = s.get(v)f.write(r.content)
https://en.xdnf.cn/q/119934.html

Related Q&A

Reducing the complexity/computation time for a basic graph formula [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.Want to improve this question? Add details and clarify the problem by editing this post.Closed 4 years ago.Improve…

Find All Possible Fixed Size String Python

Problem: I want to generate all possible combination from 36 characters that consist of alphabet and numbers in a fixed length string. Assume that the term "fixed length" is the upper bound f…

What is the concept of namespace when importing a function from another module?

main.py:from module1 import some_function x=10 some_function()module1.py:def some_function():print str(x)When I execute the main.py, it gives an error in the moduel1.py indicating that x is not availab…

How to pass a literal value to a kedro node? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.Want to improve this question? Add details and clarify the problem by editing this post.Closed 4 years ago.This po…

How to Loop a List and Extract required data (Beautiful Soup)

I need help in looping a list and extracting the src links. This is my list and the code: getimages = getDetails.find_all(img) #deleting the first image in the list getimages[0].decompose() print(getim…

square root without pre-defined function in python

How can one find the square root of a number without using any pre-defined functions in python?I need the main logic of how a square root of a program works. In general math we will do it using HCF bu…

How do I sort a text file by three columns with a specific order to those columns in Python?

How do I sort a text file by three columns with a specific order to those columns in Python?My text_file is in the following format with whitespaces between columns:Team_Name Team_Mascot Team_Color Te…

regular expression to search only one-digit number

Im trying to find sentences having only one digit number along with.sentence="Im 30 years old." print(re.match("[0-9]", sentence)then it returns<re.Match object; span=(0, 1), mat…

Automate adding new column and field names to all csv files in directories [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.Want to improve this question? Update the question so it focuses on one problem only by editing this post.Closed 3…

Connect the python app to a database using centos 7

I am new to all this I have apython app already helo.mysql.py and need to Connect the python app to a database. I am using centos 7 and have it installed on a ec2 instance if anyone can help please he…