Question 1

I've written a script in python in combination with BeautifulSoup to extract the title of books which get populated upon providing some ISBN numbers in amazon search box. I'm providing those ISBN numbers from an excel file named amazon.xlsx. When I try using my following script, It parse the titles accordingly and write back to excel file as intended.

The link where I put isbn numbers to populate the results.

import requests
from bs4 import BeautifulSoup
from openpyxl import load_workbookwb = load_workbook('amazon.xlsx')
ws = wb['content']def get_info(num):params = {'url': 'search-alias=aps','field-keywords': num}res = requests.get("https://www.amazon.com/s/ref=nb_sb_noss?",params=params)soup = BeautifulSoup(res.text,"lxml")itemlink = soup.select_one("a.s-access-detail-page")if itemlink:get_data(itemlink['href'])def get_data(link):res = requests.get(link)soup = BeautifulSoup(res.text,"lxml")try:itmtitle = soup.select_one("#productTitle").get_text(strip=True)except AttributeError: itmtitle = "N\A"print(itmtitle)ws.cell(row=row, column=2).value = itmtitlewb.save("amazon.xlsx")if __name__ == '__main__':for row in range(2, ws.max_row + 1):if ws.cell(row=row,column=1).value==None:breakval = ws["A" + str(row)].valueget_info(val)

However, when I try to do the same using multiprocessing I get the following error:

ws.cell(row=row, column=2).value = itmtitle
NameError: name 'row' is not defined

For multiprocessing what I brought changes in my script is:

from multiprocessing import Poolif __name__ == '__main__':isbnlist = []for row in range(2, ws.max_row + 1):if ws.cell(row=row,column=1).value==None:breakval = ws["A" + str(row)].valueisbnlist.append(val)with Pool(10) as p:p.map(get_info,isbnlist)p.terminate()p.join()

Few of the ISBN I've tried with:

9781584806844
9780917360664
9780134715308
9781285858265
9780986615108
9780393646399
9780134612966
9781285857589
9781453385982
9780134683461

How Can I get rid of that error and get the desired results using multiprocessing?

Question 2

It does not make sense to reference the global variable row in get_data(), because

It's a global and will not be shared between each "thread" in the multiprocessing Pool, because they are actually separate python processes that do not share globals.
Even if they did, because you're building the entire ISBN list before executing get_info(), the value of row will always be ws.max_row + 1 because the loop has completed.

So you would need to provide the row values as part of the data passed to the second argument of p.map(). But even if you were to do that, writing to and saving the spreadsheet from multiple processes is a bad idea due to Windows file locking, race conditions, etc. You're better off just building the list of titles with multiprocessing, and then writing them out once when that's done, as in the following:

import requests
from bs4 import BeautifulSoup
from openpyxl import load_workbook
from multiprocessing import Pooldef get_info(isbn):params = {'url': 'search-alias=aps','field-keywords': isbn}res = requests.get("https://www.amazon.com/s/ref=nb_sb_noss?", params=params)soup = BeautifulSoup(res.text, "lxml")itemlink = soup.select_one("a.s-access-detail-page")if itemlink:return get_data(itemlink['href'])def get_data(link):res = requests.get(link)soup = BeautifulSoup(res.text, "lxml")try:itmtitle = soup.select_one("#productTitle").get_text(strip=True)except AttributeError:itmtitle = "N\A"return itmtitledef main():wb = load_workbook('amazon.xlsx')ws = wb['content']isbnlist = []for row in range(2, ws.max_row + 1):if ws.cell(row=row, column=1).value is None:breakval = ws["A" + str(row)].valueisbnlist.append(val)with Pool(10) as p:titles = p.map(get_info, isbnlist)p.terminate()p.join()for row in range(2, ws.max_row + 1):ws.cell(row=row, column=2).value = titles[row - 2]wb.save("amazon.xlsx")if __name__ == '__main__':main()

Script throws an error when it is made to run using multiprocessing

Related Q&A

Efficiently pair random elements of list

ALL permutations of a list with repetition but not doubles

NameError: name current_portfolio is not defined

Scrape an Ajax form with .submit() with Python and Selenium

How to process break an array in Python?

Why am I getting replacement index 1 out of range for positional args tuple error

Python: Find keywords in a text file from another text file

How to split a list into chucks of different sizes specified by another list? [duplicate]

Python Sum of digits in a string function

How to select columns using dynamic select query using window function