Unable to scrape the name from the inner page of each result using requests

2024/11/13 9:07:48

I've created a script in python making use of post http requests to get the search results from a webpage. To populate the results, it is necessary to click on the fields sequentially shown here. Now a new page will be there and this is how to populate the result.

There are ten results in the first page and the following script can parse the results flawlessly.

What I wish to do now is use the results to reach their inner page in order to parse Sole Proprietorship Name (English) from there.

website address

I've tried so far with:

import re
import requests
from bs4 import BeautifulSoupurl = "https://www.businessregistration.moc.gov.kh/cambodia-master/service/create.html?targetAppCode=cambodia-master&targetRegisterAppCode=cambodia-br-soleproprietorships&service=registerItemSearch"payload = {'QueryString': '0','SourceAppCode': 'cambodia-br-soleproprietorships','OriginalVersionIdentifier': '','_CBASYNCUPDATE_': 'true','_CBHTMLFRAG_': 'true','_CBNAME_': 'buttonPush'
}with requests.Session() as s:s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0'res = s.get(url)target_url = res.url.split("&")[0].replace("view.", "update.")node = re.findall(r"nodeW\d.+?-Advanced",res.text)[0].strip()payload['_VIKEY_'] = re.findall(r"viewInstanceKey:'(.*?)',", res.text)[0].strip()payload['_CBHTMLFRAGID_'] = re.findall(r"guid:(.*?),", res.text)[0].strip()payload[node] = 'N'payload['_CBNODE_'] = re.findall(r"Callback\('(.*?)','buttonPush", res.text)[2]payload['_CBHTMLFRAGNODEID_'] = re.findall(r"AsyncWrapper(W\d.+?)'",res.text)[0].strip()res = s.post(target_url,data=payload)soup = BeautifulSoup(res.content, 'html.parser')for item in soup.find_all("span", class_="appReceiveFocus")[3:]:print(item.text)

How can I parse the Name (English) from each of the results inner page using requests?

Answer

This is one of the ways you can parse the name from the site's inner page and then email address from the address tab. I added this function .get_email() only because I wanted to let you know as to how you can parse content from different tabs.

import re
import requests
from bs4 import BeautifulSoupurl = "https://www.businessregistration.moc.gov.kh/cambodia-master/service/create.html?targetAppCode=cambodia-master&targetRegisterAppCode=cambodia-br-soleproprietorships&service=registerItemSearch"
result_url = "https://www.businessregistration.moc.gov.kh/cambodia-master/viewInstance/update.html?id={}"
base_url = "https://www.businessregistration.moc.gov.kh/cambodia-br-soleproprietorships/viewInstance/update.html?id={}"def get_names(s):s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0'res = s.get(url)target_url = result_url.format(res.url.split("id=")[1])soup = BeautifulSoup(res.text,"lxml")payload = {i['name']:i.get('value','') for i in soup.select('input[name]')}payload['QueryString'] = 'a'payload['SourceAppCode'] = 'cambodia-br-soleproprietorships'payload['_CBNAME_'] = 'buttonPush'payload['_CBHTMLFRAG_'] = 'true'payload['_VIKEY_'] = re.findall(r"viewInstanceKey:'(.*?)',", res.text)[0].strip()payload['_CBHTMLFRAGID_'] = re.findall(r"guid:(.*?),", res.text)[0].strip()payload['_CBNODE_'] = re.findall(r"Callback\('(.*?)','buttonPush", res.text)[-1]payload['_CBHTMLFRAGNODEID_'] = re.findall(r"AsyncWrapper(W\d.+?)'",res.text)[0].strip()res = s.post(target_url,data=payload)soup = BeautifulSoup(res.text,"lxml")payload.pop('_CBHTMLFRAGNODEID_')payload.pop('_CBHTMLFRAG_')payload.pop('_CBHTMLFRAGID_')for item in soup.select("a[class*='ItemBox-resultLeft-viewMenu']"):payload['_CBNAME_'] = 'invokeMenuCb'payload['_CBVALUE_'] = ''payload['_CBNODE_'] = item['id'].replace('node','')res = s.post(target_url,data=payload)soup = BeautifulSoup(res.text,'lxml')address_url = base_url.format(res.url.split("id=")[1])node_id = re.findall(r"taba(.*)_",soup.select_one("a[aria-label='Addresses']")['id'])[0]payload['_CBNODE_'] = node_idpayload['_CBHTMLFRAGID_'] = re.findall(r"guid:(.*?),", res.text)[0].strip()payload['_CBNAME_'] = 'tabSelect'payload['_CBVALUE_'] = '1'eng_name = soup.select_one(".appCompanyName + .appAttrValue").get_text()yield from get_email(s,eng_name,address_url,payload)def get_email(s,eng_name,url,payload):res = s.post(url,data=payload)soup = BeautifulSoup(res.text,'lxml')email = soup.select_one(".EntityEmailAddresses:contains('Email') .appAttrValue").get_text()yield eng_name,emailif __name__ == '__main__':with requests.Session() as s:for item in get_names(s):print(item)

Output are like:

('AMY GEMS', '[email protected]')
('AHARATHAN LIN LIANJIN FOOD FLAVOR', '[email protected]')
('AMETHYST DIAMOND KTV', '[email protected]')
https://en.xdnf.cn/q/119766.html

Related Q&A

Python Integer and String Using [duplicate]

This question already has an answer here:How can I concatenate str and int objects?(1 answer)Closed 7 years ago.for size in [1, 2, 3, 4]:result = 0print("size=" + str(size))for element in ra…

Beginner to python: Lists, Tuples, Dictionaries, Sets [duplicate]

This question already has an answer here:What is the difference between lists,tuples,sets and dictionaries? [closed](1 answer)Closed 3 years ago.I have been trying to understand what the difference is…

TypeError: NoneType object is not iterable in Python in csv

I am new to python, and trying to create a program which opens a csv file. The user is supposed to enter a barcode , then the program finds that product and the cost of the product. However I got an er…

No such Element Exception using selenium in python

from selenium import webdriver from selenium.webdriver.common.keys import Keys chrome_path=r"C:\Users\Priyanshu\Downloads\Compressed\chromedriver_win32\chromedriver.exe" driver=webdriver.Chro…

Web scraping, cant get the href of a tag

Im trying to scrape this Page https://rarity.tools/thecryptodads Using Selenium in python. At the top of the right of each card below, theres the owner name that contains a link once pressed, it takes …

Using Python Pandas to fill new table with NaN values

Ive imported data from a csv file which has columns NAME, ADDRESS, PHONE_NUMBER. Sometimes, at least 1 of the columns has a missing value for that row. e.g0 - Bill, Flat 2, 555123 1 - Katie, NaN, NaN 2…

sympy AttributeError: Pow object has no attribute sin

I have read this SO post which says namespace conflict is one reason for this error. I am falling to this error frequently. So, Id like to learn what exactly is happening here? What is expected by the…

Tkinter unbinding key event issue

In the code below, pressing the space bar twice results in two successive beeps. I want to avoid this and instead disable the key while the first beep is happening. I thought unbinding the space key mi…

Is there a way to find the largest change in a pandas dataframe column?

Im trying to find the largest difference between i and j in a series where i cannot be before j. Is there an efficient way to do this in pandas:x = [1, 2, 5, 4, 2, 4, 2, 1, 7] largest_change = 0for i i…

Updating scikit-learn to latest version with Anaconda environment fails with http error 000

I use Anaconda3 installed on my pc Win10 64bits. I noticed it runs with an outdated scikit learn version (0.21.3), and I am trying to update it (0.24.1 available on https://repo.anaconda.com/pkgs/main/…