For loop for web scraping in python

2024/10/6 0:26:27

I have a small project working on web-scraping Google search with a list of keywords. I have built a nested For loop for scraping the search results. The problem is that a for loop for searching keywords in the list does not work as I intended to, which is scraping the data from each searching result. The results get only the result of the last keyword, except for the first two search results.

Here is the code:

browser = webdriver.Chrome(r"C:\...\chromedriver.exe")df = pd.DataFrame(columns = ['ceo', 'value'])baseUrl = 'https://www.google.com/search?q='html = browser.page_source
soup = BeautifulSoup(html)ceo_list = ["Bill Gates", "Elon Musk", "Warren Buffet"]
values =[]for ceo in ceo_list:browser.get(baseUrl + ceo)r = soup.select('div.g.rhsvw.kno-kp.mnr-c.g-blk')df = pd.DataFrame()for i in r:value = i.select_one('div.Z1hOCe').text                     ceo = i.select_one('.kno-ecr-pt.PZPZlf.gsmt.i8lZMc').text   values = [ceo, value]s = pd.Series(values)df = df.append(s,ignore_index=True)print(df)

The output:

              0                                                  1
0  Warren Buffet  Born: October 28, 1955 (age 64 years), Seattle...

The output that I am expecting is as this:

              0                                                  1
0  Bill Gates      Born:..........
1  Elon Musk       Born:...........
2  Warren Buffett  Born: August 30, 1930 (age 89 years), Omaha, N...Any suggestions or comments are welcome here.
Answer

Declare df = pd.DataFrame() outside the for loop

Since currently, you have defined it inside the loop, for each keyword in your list it will initialize a new data frame and the older will be replaced. That's why you are just getting the result for the last keyword.

Try this:

browser = webdriver.Chrome(r"C:\...\chromedriver.exe")
df = pd.DataFrame(columns = ['ceo', 'value'])
baseUrl = 'https://www.google.com/search?q='
html = browser.page_source
soup = BeautifulSoup(html)
ceo_list = ["Bill Gates", "Elon Musk", "Warren Buffet"]
df = pd.DataFrame()
for ceo in ceo_list:browser.get(baseUrl + ceo)r = soup.select('div.g.rhsvw.kno-kp.mnr-c.g-blk')for i in r:value = i.select_one('div.Z1hOCe').text                     ceo = i.select_one('.kno-ecr-pt.PZPZlf.gsmt.i8lZMc').texts = pd.Series([ceo, value])df = df.append(s,ignore_index=True)
print(df)
https://en.xdnf.cn/q/119004.html

Related Q&A

operation on a variable inside a class in python

Im new with oop and python. Ive been trying to do a simple thing: there is class called Foo(),it contains a variable called x which is initially set to zero.>>>a = Foo() >>>a.x >&g…

Print several sentences with different colors

Im trying to print several sentences with different colors, but it wont work, I only got 2 colors, the normal blue and this redimport sys from colorama import init, AnsiToWin32stream = AnsiToWin32(sys.…

Discord bot to send a random image from the chosen file

I am making a discord bot that randomly chooses an image (images) which is in the same directory (Cats) as the python file(cats.py). This is what my code looks like right now: Cats = os.path.join(os.pa…

pytest - patched method of a class does not return the mock value

My code is fairly simple but i dont understand what is going on :class MyDb :def some_func( arg ) :....while my test code is :@mock.patch(mypkg.mydb) @pytest.mark.parametrize( func_dummy_value ) :( [ {…

New instance of toplevel classes make overlapping widgets

Im generally new to python and tkinter. Ive been programming maybe about a year or so, and Ive just started to try to make each tkinter toplevel window its own class because Ive heard that its the righ…

Regex End of Line and Specific Chracters

So Im writing a Python program that reads lines of serial data, and compares them to a dictionary of line codes to figure out which specific lines are being transmitted. I am attempting to use a Regul…

Is it possible to scrape webpage without using third-party libraries in python?

I am trying to understand how beautiful soup works in python. I used beautiful soup,lxml in my past but now trying to implement one script which can read data from given webpage without any third-party…

Different model performance evaluations by statsmodels and scikit-learn

I am trying to fit a multivariable linear regression on a dataset to find out how well the model explains the data. My predictors have 120 dimensions and I have 177 samples:X.shape=(177,120), y.shape=(…

Python to search CSV file and return relevant info

I have a csv file with information about some computers on our network. Id like to be able to type from the command line a quick line to bring me back the relevant items from the csv. In the format:$…

Remove all elements matching a predicate from a list in-place

How can I remove all elements matching a predicate from a list IN-PLACE? T = TypeVar("T")def remove_if(a: list[T], predicate: Callable[[T], bool]):# TODO: Fill this in.# Test: a = [1, 2, 3, …