Script throws an error when it is made to run using multiprocessing

2024/9/22 14:27:15

I've written a script in python in combination with BeautifulSoup to extract the title of books which get populated upon providing some ISBN numbers in amazon search box. I'm providing those ISBN numbers from an excel file named amazon.xlsx. When I try using my following script, It parse the titles accordingly and write back to excel file as intended.

The link where I put isbn numbers to populate the results.

import requests
from bs4 import BeautifulSoup
from openpyxl import load_workbookwb = load_workbook('amazon.xlsx')
ws = wb['content']def get_info(num):params = {'url': 'search-alias=aps','field-keywords': num}res = requests.get("https://www.amazon.com/s/ref=nb_sb_noss?",params=params)soup = BeautifulSoup(res.text,"lxml")itemlink = soup.select_one("a.s-access-detail-page")if itemlink:get_data(itemlink['href'])def get_data(link):res = requests.get(link)soup = BeautifulSoup(res.text,"lxml")try:itmtitle = soup.select_one("#productTitle").get_text(strip=True)except AttributeError: itmtitle = "N\A"print(itmtitle)ws.cell(row=row, column=2).value = itmtitlewb.save("amazon.xlsx")if __name__ == '__main__':for row in range(2, ws.max_row + 1):if ws.cell(row=row,column=1).value==None:breakval = ws["A" + str(row)].valueget_info(val)

However, when I try to do the same using multiprocessing I get the following error:

ws.cell(row=row, column=2).value = itmtitle
NameError: name 'row' is not defined

For multiprocessing what I brought changes in my script is:

from multiprocessing import Poolif __name__ == '__main__':isbnlist = []for row in range(2, ws.max_row + 1):if ws.cell(row=row,column=1).value==None:breakval = ws["A" + str(row)].valueisbnlist.append(val)with Pool(10) as p:p.map(get_info,isbnlist)p.terminate()p.join()

Few of the ISBN I've tried with:

9781584806844
9780917360664
9780134715308
9781285858265
9780986615108
9780393646399
9780134612966
9781285857589
9781453385982
9780134683461

How Can I get rid of that error and get the desired results using multiprocessing?

Answer

It does not make sense to reference the global variable row in get_data(), because

  1. It's a global and will not be shared between each "thread" in the multiprocessing Pool, because they are actually separate python processes that do not share globals.

  2. Even if they did, because you're building the entire ISBN list before executing get_info(), the value of row will always be ws.max_row + 1 because the loop has completed.

So you would need to provide the row values as part of the data passed to the second argument of p.map(). But even if you were to do that, writing to and saving the spreadsheet from multiple processes is a bad idea due to Windows file locking, race conditions, etc. You're better off just building the list of titles with multiprocessing, and then writing them out once when that's done, as in the following:

import requests
from bs4 import BeautifulSoup
from openpyxl import load_workbook
from multiprocessing import Pooldef get_info(isbn):params = {'url': 'search-alias=aps','field-keywords': isbn}res = requests.get("https://www.amazon.com/s/ref=nb_sb_noss?", params=params)soup = BeautifulSoup(res.text, "lxml")itemlink = soup.select_one("a.s-access-detail-page")if itemlink:return get_data(itemlink['href'])def get_data(link):res = requests.get(link)soup = BeautifulSoup(res.text, "lxml")try:itmtitle = soup.select_one("#productTitle").get_text(strip=True)except AttributeError:itmtitle = "N\A"return itmtitledef main():wb = load_workbook('amazon.xlsx')ws = wb['content']isbnlist = []for row in range(2, ws.max_row + 1):if ws.cell(row=row, column=1).value is None:breakval = ws["A" + str(row)].valueisbnlist.append(val)with Pool(10) as p:titles = p.map(get_info, isbnlist)p.terminate()p.join()for row in range(2, ws.max_row + 1):ws.cell(row=row, column=2).value = titles[row - 2]wb.save("amazon.xlsx")if __name__ == '__main__':main()
https://en.xdnf.cn/q/119125.html

Related Q&A

Efficiently pair random elements of list

I have a list of n elements say: foo = [a, b, c, d, e] I would like to randomly pair elements of this list to receive for example: bar = [[a, c], [b, e]] where the last element will be discarded if the…

ALL permutations of a list with repetition but not doubles

I have seen similar but not the same: here. I definitely want the permutations, not combinations, of all list elements. Mine is different because itertools permutation of a,b,c returns abc but not aba …

NameError: name current_portfolio is not defined

I am getting NameError: name current_portfolio is not defineddef initialize(context): context.sym = symbol(xxx) context.i = 0def handle_data(context, data):context.i += 1 if context.i < 60:returnsma…

Scrape an Ajax form with .submit() with Python and Selenium

I am trying to get the link from a web page. The web page sends the request using javascript, then the server sends a response which goes directly to download a PDF. This new PDF is automatically downl…

How to process break an array in Python?

I would like to use a double array. But I still fail to do it. This what I did. Folder = "D:\folder" Name = [gadfg5, 546sfdgh] Ver = [None, hhdt5463]for dn in Name :for dr in Ver :if dr is No…

Why am I getting replacement index 1 out of range for positional args tuple error

I keep getting this error: Replacement index 1 out of range for positional args tuple on this line of code: print("{1}, {2}, {3}, {4}".format(question[3]), question[4], question[5], question[…

Python: Find keywords in a text file from another text file

Take this invoice.txt for exampleInvoice NumberINV-3337Order Number12345Invoice DateJanuary 25, 2016Due DateJanuary 31, 2016And this is what dict.txt looks like:Invoice DateInvoice NumberDue DateOrder …

How to split a list into chucks of different sizes specified by another list? [duplicate]

This question already has answers here:How to Split or break a Python list into Unequal chunks, with specified chunk sizes(3 answers)Closed 4 years ago.I have an array I am trying to split into chunks …

Python Sum of digits in a string function

My function needs to take in a sentence and return the sum of the numbers inside. Any advice?def sumOfDigits(sentence):sumof=0for x in sentence:if sentence.isdigit(x)== True:sumof+=int(x)return sumof

How to select columns using dynamic select query using window function

I have sample input dataframe as below, but the value (clm starting with m) columns can be n number. customer_id|month_id|m1 |m2 |m3 .......m_n 1001 | 01 |10 |20 1002 | 01 |20…