python - multiprocessing module

2024/11/15 12:51:36

Here's what I am trying to accomplish -

  1. I have about a million files which I need to parse & append the parsed content to a single file.
  2. Since a single process takes ages, this option is out.
  3. Not using threads in Python as it essentially comes to running a single process (due to GIL).
  4. Hence using multiprocessing module. i.e. spawning 4 sub-processes to utilize all that raw core power :)

So far so good, now I need a shared object which all the sub-processes have access to. I am using Queues from the multiprocessing module. Also, all the sub-processes need to write their output to a single file. A potential place to use Locks I guess. With this setup when I run, I do not get any error (so the parent process seems fine), it just stalls. When I press ctrl-C I see a traceback (one for each sub-process). Also no output is written to the output file. Here's code (note that everything runs fine without multi-processes) -

import os
import glob
from multiprocessing import Process, Queue, Pooldata_file  = open('out.txt', 'w+')def worker(task_queue):for file in iter(task_queue.get, 'STOP'):data = mine_imdb_page(os.path.join(DATA_DIR, file))if data:data_file.write(repr(data)+'\n')returndef main():task_queue = Queue()for file in glob.glob('*.csv'):task_queue.put(file)task_queue.put('STOP') # so that worker processes know when to stop# this is the block of code that needs correction.if multi_process:# One way to spawn 4 processes# pool = Pool(processes=4) #Start worker processes# res  = pool.apply_async(worker, [task_queue, data_file])# But I chose to do it like this for now.for i in range(4):proc = Process(target=worker, args=[task_queue])proc.start()else: # single process mode is working fine!worker(task_queue)data_file.close()return

what am I doing wrong? I also tried passing the open file_object to each of the processes at the time of spawning. But to no effect. e.g.- Process(target=worker, args=[task_queue, data_file]). But this did not change anything. I feel the subprocesses are not able to write to the file for some reason. Either the instance of the file_object is not getting replicated (at the time of spawn) or some other quirk... Anybody got an idea?

EXTRA: Also Is there any way to keep a persistent mysql_connection open & pass it across to the sub_processes? So I open a mysql connection in my parent process & the open connection should be accessible to all my sub-processes. Basically this is the equivalent of a shared_memory in python. Any ideas here?

Answer

Although the discussion with Eric was fruitful, later on I found a better way of doing this. Within the multiprocessing module there is a method called 'Pool' which is perfect for my needs.

It's optimizes itself to the number of cores my system has. i.e. only as many processes are spawned as the no. of cores. Of course this is customizable. So here's the code. Might help someone later-

from multiprocessing import Pooldef main():po = Pool()for file in glob.glob('*.csv'):filepath = os.path.join(DATA_DIR, file)po.apply_async(mine_page, (filepath,), callback=save_data)po.close()po.join()file_ptr.close()def mine_page(filepath):#do whatever it is that you want to do in a separate process.return datadef save_data(data):#data is a object. Store it in a file, mysql or...return

Still going through this huge module. Not sure if save_data() is executed by parent process or this function is used by spawned child processes. If it's the child which does the saving it might lead to concurrency issues in some situations. If anyone has anymore experience in using this module, you appreciate more knowledge here...

https://en.xdnf.cn/q/71885.html

Related Q&A

How to make VSCode always run main.py

I am writing my first library in Python, When developing I want my run code button in VS Code to always start running the code from the main.py file in the root directory. I have added a new configurat…

Why does tesseract fail to read text off this simple image?

I have read mountains of posts on pytesseract, but I cannot get it to read text off a dead simple image; It returns an empty string.Here is the image:I have tried scaling it, grayscaling it, and adjust…

python click subcommand unified error handling

In the case where there are command groups and every sub-command may raise exceptions, how can I handle them all together in one place?Given the example below:import click@click.group() def cli():pass…

Data structure for large ranges of consecutive integers?

Suppose you have a large range of consecutive integers in memory, each of which belongs to exactly one category. Two operations must be O(log n): moving a range from one category to another, and findin…

polars slower than numpy?

I was thinking about using polars in place of numpy in a parsing problem where I turn a structured text file into a character table and operate on different columns. However, it seems that polars is ab…

namespace error lxml xpath python

I am transforming word documents to xml to compare them using the following code:word = win32com.client.Dispatch(Word.Application) wd = word.Documents.Open(inFile) # Converts the word infile to xml out…

lark grammar: How does the escaped string regex work?

The lark parser predefines some common terminals, including a string. It is defined as follows:_STRING_INNER: /.*?/ _STRING_ESC_INNER: _STRING_INNER /(?<!\\)(\\\\)*?/ ESCAPED_STRING : "\&quo…

Pycharm unresolved reference on join of os.path

After upgrade pycharm to 2018.1, and upgrade python to 3.6.5, pycharm reports "unresolved reference join". The last version of pycharm doesnt show any warning for the line below:from os.path …

Apply Border To Range Of Cells Using Openpyxl

I am using python 2.7.10 and openpyxl 2.3.2 and I am a Python newbie.I am attempting to apply a border to a specified range of cells in an Excel worksheet (e.g. C3:H10). My attempt below is failing wit…

Make a functional field editable in Openerp?

How to make functional field editable in Openerp?When we createcapname: fields.function(_convert_capital, string=Display Name, type=char, store=True ),This will be displayed has read-only and we cant …