from bs4 import BeautifulSoup, SoupStrainer
from urllib.request import urlopen
import pandas as pd
import numpy as np
import re
import csv
import ssl
import json
from googlesearch import search
from queue import Queue
import re links = []
menu = []
filtered_menu = []def contains(substring, string):if substring.lower() in string.lower():return Trueelse:return Falsefor website in search("mr puffs", tld="com", num=1, stop=1, country="canada", pause=4): links.append(website)soup = BeautifulSoup(urlopen(links.pop(0)), features="html.parser")
menu = soup.find_all('a', href=True)for string in menu:if contains("contact", string):filtered_menu.append(string)print(filtered_menu)
I am creating a webscraper that will extract contact information from sites. However, in order to do that, I need to get to the contact page of the website. Using the googlesearch library, the code searches for a keyword and puts all the results (up to a certain limit) in a list. For simplicity, in this code, we are just putting in the first link. Now, from this link, I am creating a beautiful soup object and I am extracting all the other links on the website(because the contact information is usually not found on the homepage). I am putting these links in a list called menu.
Now, I want to filter menu for only links that have "contact" in it. Example: "www.smallBusiness.com/our-services" would be deleted from the new list while "www.smallBusiness.com/contact" or "www.smallBusiness.com/contact-us" will stay in the list.
I defined a method that checks if a substring is in a string. However, I get the following exception:
TypeError: 'NoneType' object is not callable.
I've tried using regex by doing re.search but it says that the expected type of string or byte-like value is not in the parameters.
I think it's because the return type of find_all is not a string. It's probably something else which I can't find in the docs. If so, how do I convert it into a string?
As requested in the answer below, here's what printing menu list gives:
From here, I just want to extract the highlighted links: