How to navigate through HTMl pages that have paging for their content using Python? [closed]

2024/10/6 11:32:13

I want to crawl all the table entries(table that describes the S/No. , Document No., etc.) from the following website and write it to excel. So far, I am able to crawl the data from the first page (10 entries) only. Can anyone please help me with the python piece of code to crawl the data from first to last page on this website.

Website: https://www.gebiz.gov.sg/scripts/main.do?sourceLocation=openarea&select=tenderId

My python code:

from bs4 import BeautifulSoup
import requests
import sys
import mechanize
import pprint
import re
import csv
import urllib
import urllib2browser = mechanize.Browser()
browser.set_handle_robots(False)
url = 'https://www.gebiz.gov.sg/scripts/main.do?sourceLocation=openarea&select=tenderId'
response = browser.open(url)
html_doc = response.read()rows_list = []
table_dict = {}soup = BeautifulSoup(html_doc)table = soup.find("table", attrs={"width": "100%", "border": "0", "cellspacing": "2", "cellpadding": "3", "bgcolor": "#FFFFFF"})
tr_elements = table.find_all("tr", class_=re.compile((ur'(row_even|row_odd|header_subone)')))table_rows = []for i in range(0, len(tr_elements)):tr_element = tr_elements[i]td_elements_in_tr_element = tr_element.find_all("td")rows_list.append([])for j in range(0, len(td_elements_in_tr_element)):td_element = td_elements_in_tr_element[j]table_elements_in_td_element = td_element.find_all("table")if len(table_elements_in_td_element) > 0:continuerows_list[i].append(td_element.text)pprint.pprint(len(table_elements_in_td_element))
pprint.pprint(rows_list)rows_list.remove([])for row in rows_list:
table_dict[row[0]] = {#'S/No.' : row[1],'Document No.': row[1] + row[2],'Tenders and Quotations': row[3] + row[4],'Publication Date': row[5],'Closing Date': row[6],'Status': row[7]
}pprint.pprint(table_dict)with open('gebiz.csv', 'wb') as csvfile:csvwriter = csv.writer(csvfile, dialect='excel')for key in sorted(table_dict.iterkeys()):csvwriter.writerow([table_dict[key]['Document No.'], table_dict[key]['Tenders and Quotations'], table_dict[key]['Publication Date'], table_dict[key]['Closing Date'], table_dict[key]['Status']])

Every help from your side will be highly appreciated.

Answer

As I can see in this page, you need to interact with java script that is invoked by button Go or Next Page button. For Go button you need to fill the textbox each time. You can use different approaches to work around this:

1) Selenium - Web Browser Automation

2) spynner - Programmatic web browsing module with AJAX support for Python and also take look here

3) If you are familiar with c#, it also provide a webBrowser component that helps you to click on the html elements. (e.g. here). You save html content of each page and later on crawl them from offline pages.

https://en.xdnf.cn/q/119784.html

Related Q&A

How to merge one list elements with another list elements in python? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.Want to improve this question? Add details and clarify the problem by editing this post.Closed 3 years ago.Improve…

Display and play audio files

I am new to python, and Im trying to build a simple recording program. With the some help from my previous question, I was able to add a timestamp for each recorded fileEDIT:I did some research and dec…

Web Scraping BeautifulSoup - Next Page parsing

Im just learning web scraping & want to output the result of this website to a csv file https://www.avbuyer.com/aircraft/private-jets but am struggling with parsing the next pages here is my code (…

convert sum value to percentage by userid django

Im trying to convert the total sum to a percentage by userid but an error pops up when I try to run the following program. The error is: name mark is not definedBelow is my code for views.pydef attStud…

ValueError: Too many values to unpack

Task is to find,sort,and remove the student with type: "homework" and with the lowest score using MongoDB. I also tried to use toArray() function,but it gave an error. Now I try to move on in…

Pandas - Create dynamic column(s) from a single columns values

I have JSON data which I am planning after converting it to desired dataframe, will concat with another dataframe. Participant**row 1** [{roles: [{type: director}, {type: founder}, {type: owner}, {type…

How to automatically remove certain preprocessors directives and comments from a C header-file?

Whats a good way to remove all text from a file which lies between /* */ and #if 0 and corresponding #endif? I want to strip these parts from C headers. This is the code I have so far:For line in file…

Get all pairs from elements in sublists

I have a list of sublists. I need all possible pairs between the elements in the sublists. For example, for a list like this: a=[[1,2,3],[4,5],[6]]The result should be: result=[[1,4], [1,5], [1,6], [2,…

Extracting variables from Javascript inside HTML

I need all the lines which contains the text .mp4. The Html file has no tag!My code:import urllib.request import demjson url = (https://myurl) content = urllib.request.urlopen(url).read()<script typ…

Pygame, self is not defined [closed]

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.This question was caused by a typo or a problem that can no longer be reproduced. While similar q…