I have around 900 pages and each page contains 10 buttons (each button has pdf). I want to download all the pdf's - the program should browse to all the pages and download the pdfs one by one.
Code only searching for .pdf but my href does not have .pdf page_no (1 to 900).
https://bidplus.gem.gov.in/bidlists?bidlists&page_no=3
This is the website and below is the link:
BID NO: GEM/2021/B/1804626
import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoupurl = "https://bidplus.gem.gov.in/bidlists"#If there is no such folder, the script will create one automatically
folder_location = r'C:\webscraping'
if not os.path.exists(folder_location):os.mkdir(folder_location)response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")
for link in soup.select("a[href$='.pdf']"):#Name the pdf files using the last portion of each link which are unique in this casefilename = os.path.join(folder_location,link['href'].split('/')[-1])with open(filename, 'wb') as f:f.write(requests.get(urljoin(url,link['href'])).content)
You only need the href as associated with the links you call buttons. Then prefix with the appropriate protocol + domain.
The links can be matched with the following selector:
.bid_no > a
That is anchor (a) tags with direct parent element having class bid_no
.
This should pick up 10 links per page. As you will need a file name for each download I suggest having a global dict, which you store the links as values and link text as keys in. I would replace the "\" in the link descriptions with "_". You simply add to this during your loop over the desired number of pages.
An example of some of the dictionary entries:
As there are over 800 pages I have chosen to add in an additional termination page count variable called end_number
. I don't want to loop to all pages so this allows me an early exit. You can remove this param if so desired.
Next, you need to determine the actual number of pages. For this you can use the following css selector to get the Last
pagination link and then extract its data-ci-pagination-page
value and convert to integer. This can then be the num_pages
(number of pages) to terminate your loop at:
.pagination li:last-of-type > a
That looks for an a
tag which is a direct child of the last li
element, where those li
elements have a shared parent with class pagination
i.e. the anchor tag in the last li
, which is the last page link in the pagination element.
Once you have all your desired links and file suffixes (the description text for the links) in your dictionary, loop the key, value pairs and issue requests for the content. Write that content out to disk.
TODO:
I would suggest you look at ways of optimizing the final issuing of requests and writing out to disk. For example, you could first issue all requests asynchronously and store in a dictionary to optimize what would be an I/0-bound process. Then loop that writing to disk perhaps with a multi-processing approach to optimize for a more CPU-bound process.
I would additionally consider if some sort of wait should be introduced between requests. Or if requests should be batches. You could theoretically currently have something like (836 * 10) + 836 requests.
import requests
from bs4 import BeautifulSoup as bsend_number = 3
current_page = 1
pdf_links = {}
path = '<your path>'with requests.Session() as s:while True:r = s.get(f'https://bidplus.gem.gov.in/bidlists?bidlists&page_no={current_page}')soup = bs(r.content, 'lxml')for i in soup.select('.bid_no > a'):pdf_links[i.text.strip().replace('/', '_')] = 'https://bidplus.gem.gov.in' + i['href']#print(pdf_links)if current_page == 1:num_pages = int(soup.select_one('.pagination li:last-of-type > a')['data-ci-pagination-page'])print(num_pages)if current_page == num_pages or current_page > end_number:breakcurrent_page+=1for k,v in pdf_links.items():with open(f'{path}/{k}.pdf', 'wb') as f:r = s.get(v)f.write(r.content)