Occasional deadlock in multiprocessing.Pool

2024/10/12 11:14:36

I have N independent tasks that are executed in a multiprocessing.Pool of size os.cpu_count() (8 in my case), with maxtasksperchild=1 (i.e. a fresh worker process is created for each new task).

The main script can be simplified to:

import subprocess as sp
import multiprocessing as mpdef do_work(task: dict) -> dict:res = {}# ... work ...for i in range(5):out = sp.run(cmd, stdout=sp.PIPE, stderr=sp.PIPE, check=False, timeout=60)res[i] = out.stdout.decode('utf-8')# ... some more work ...return resif __name__ == '__main__':tasks = load_tasks_from_file(...) # list of dictslogger = mp.get_logger()results = []with mp.Pool(processes=os.cpu_count(), maxtasksperchild=1) as pool:for i, res in enumerate(pool.imap_unordered(do_work, tasks), start=1):results.append(res)logger.info('PROGRESS: %3d/%3d', i, len(tasks))dump_results_to_file(results)

The pool sometimes gets stuck. The traceback when I do a KeyboardInterrupt is here. It indicates that the pool won't fetch new tasks and/or worker processes are stuck in a queue / pipe recv() call. I was unable to reproduce this deterministically, varying different configs of my experiments. There's a chance that if I run the same code again, it'll finish gracefully.

Further observations:

  • Python 3.7.9 on x64 Linux
  • start method for multiprocessing is fork (using spawn does not solve the issue)
  • strace reveals that the processes are stuck in a futex wait; gdb's backtrace also shows: do_futex_wait.constprop
  • disabling logging / explicit flushing does not help
  • there's no bug in how a task is defined (i.e. they are all loadable).

Update: It seems that deadlock occurs even with a pool of size = 1.

strace reports that the process is blocked on trying to acquire some lock located at 0x564c5dbcd000:

futex(0x564c5dbcd000, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY

and gdb confirms:

(gdb) bt
#0  0x00007fcb16f5d014 in do_futex_wait.constprop () from /usr/lib/libpthread.so.0
#1  0x00007fcb16f5d118 in __new_sem_wait_slow.constprop.0 () from /usr/lib/libpthread.so.0
#2  0x0000564c5cec4ad9 in PyThread_acquire_lock_timed (lock=0x564c5dbcd000, microseconds=-1, intr_flag=0)at /tmp/build/80754af9/python_1598874792229/work/Python/thread_pthread.h:372
#3  0x0000564c5ce4d9e2 in _enter_buffered_busy (self=self@entry=0x7fcafe1e7e90)at /tmp/build/80754af9/python_1598874792229/work/Modules/_io/bufferedio.c:282
#4  0x0000564c5cf50a7e in _io_BufferedWriter_write_impl.isra.2 (self=0x7fcafe1e7e90)at /tmp/build/80754af9/python_1598874792229/work/Modules/_io/bufferedio.c:1929
#5  _io_BufferedWriter_write (self=0x7fcafe1e7e90, arg=<optimized out>)at /tmp/build/80754af9/python_1598874792229/work/Modules/_io/clinic/bufferedio.c.h:396
Answer

The deadlock occurred due to high memory usage in workers, thus triggering the OOM killer which abruptly terminated the worker subprocesses, leaving the pool in a messy state.

This script reproduces my original problem.

For the time being I am considering switching to a ProcessPoolExecutor which will throw a BrokenProcessPool exception when an abrupt worker termination occurs.

References:

  • https://bugs.python.org/issue22393#msg315684
  • https://stackoverflow.com/a/24896362
https://en.xdnf.cn/q/69659.html

Related Q&A

How to kill a subprocess called using subprocess.call in python? [duplicate]

This question already has answers here:How to terminate a python subprocess launched with shell=True(11 answers)Closed 7 years ago.I am calling a command using subprocess.call(cmd, shell=True)The comma…

Printing one color using imshow [closed]

Closed. This question needs debugging details. It is not currently accepting answers.Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to repro…

Send headers along in python [duplicate]

This question already has answers here:Changing user agent on urllib2.urlopen(9 answers)Closed 9 years ago.I have the following python script and I would like to send "fake" header informatio…

How to download image to memory using python requests?

Like here How to download image using requests , but to memory, using http://docs.python-requests.org.

Accessing NumPy record array columns in Cython

Im a relatively experienced Python programmer, but havent written any C in a very long time and am attempting to understand Cython. Im trying to write a Cython function that will operate on a column o…

scipy append all rows of one sparse matrix to another

I have a numpy matrix and want to append another matrix to that.The two matrices have the shapes:m1.shape = (2777, 5902) m2.shape = (695, 5902)I want to append m2 to m1 so that the new matrix is of sh…

Add argparse arguments from external modules

Im trying to write a Python program that could be extended by third parties. The program will be run from the command line with whatever arguments are supplied.In order to allow third parties to creat…

Cosine similarity for very large dataset

I am having trouble with calculating cosine similarity between large list of 100-dimensional vectors. When I use from sklearn.metrics.pairwise import cosine_similarity, I get MemoryError on my 16 GB ma…

What exactly are the csv modules Dialect settings for excel-tab?

The csv module implements classes to read and write tabular data in CSV format. It allows programmers to say, “write this data in the formatpreferred by Excel,” or “read data from this file which wa…

Python: how to make a recursive generator function

I have been working on generating all possible submodels for a biological problem. I have a working recursion for generating a big list of all the submodels I want. However, the lists get unmanageably …