Question 1

I have N independent tasks that are executed in a multiprocessing.Pool of size os.cpu_count() (8 in my case), with maxtasksperchild=1 (i.e. a fresh worker process is created for each new task).

The main script can be simplified to:

import subprocess as sp
import multiprocessing as mpdef do_work(task: dict) -> dict:res = {}# ... work ...for i in range(5):out = sp.run(cmd, stdout=sp.PIPE, stderr=sp.PIPE, check=False, timeout=60)res[i] = out.stdout.decode('utf-8')# ... some more work ...return resif __name__ == '__main__':tasks = load_tasks_from_file(...) # list of dictslogger = mp.get_logger()results = []with mp.Pool(processes=os.cpu_count(), maxtasksperchild=1) as pool:for i, res in enumerate(pool.imap_unordered(do_work, tasks), start=1):results.append(res)logger.info('PROGRESS: %3d/%3d', i, len(tasks))dump_results_to_file(results)

The pool sometimes gets stuck. The traceback when I do a KeyboardInterrupt is here. It indicates that the pool won't fetch new tasks and/or worker processes are stuck in a queue / pipe recv() call. I was unable to reproduce this deterministically, varying different configs of my experiments. There's a chance that if I run the same code again, it'll finish gracefully.

Further observations:

Python 3.7.9 on x64 Linux
start method for multiprocessing is fork (using spawn does not solve the issue)
strace reveals that the processes are stuck in a futex wait; gdb's backtrace also shows: do_futex_wait.constprop
disabling logging / explicit flushing does not help
there's no bug in how a task is defined (i.e. they are all loadable).

Update: It seems that deadlock occurs even with a pool of size = 1.

strace reports that the process is blocked on trying to acquire some lock located at 0x564c5dbcd000:

futex(0x564c5dbcd000, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY

and gdb confirms:

(gdb) bt
#0  0x00007fcb16f5d014 in do_futex_wait.constprop () from /usr/lib/libpthread.so.0
#1  0x00007fcb16f5d118 in __new_sem_wait_slow.constprop.0 () from /usr/lib/libpthread.so.0
#2  0x0000564c5cec4ad9 in PyThread_acquire_lock_timed (lock=0x564c5dbcd000, microseconds=-1, intr_flag=0)at /tmp/build/80754af9/python_1598874792229/work/Python/thread_pthread.h:372
#3  0x0000564c5ce4d9e2 in _enter_buffered_busy (self=self@entry=0x7fcafe1e7e90)at /tmp/build/80754af9/python_1598874792229/work/Modules/_io/bufferedio.c:282
#4  0x0000564c5cf50a7e in _io_BufferedWriter_write_impl.isra.2 (self=0x7fcafe1e7e90)at /tmp/build/80754af9/python_1598874792229/work/Modules/_io/bufferedio.c:1929
#5  _io_BufferedWriter_write (self=0x7fcafe1e7e90, arg=<optimized out>)at /tmp/build/80754af9/python_1598874792229/work/Modules/_io/clinic/bufferedio.c.h:396

Question 2

The deadlock occurred due to high memory usage in workers, thus triggering the OOM killer which abruptly terminated the worker subprocesses, leaving the pool in a messy state.

This script reproduces my original problem.

For the time being I am considering switching to a ProcessPoolExecutor which will throw a BrokenProcessPool exception when an abrupt worker termination occurs.

References:

https://bugs.python.org/issue22393#msg315684
https://stackoverflow.com/a/24896362

Occasional deadlock in multiprocessing.Pool

Related Q&A

How to kill a subprocess called using subprocess.call in python? [duplicate]

Printing one color using imshow [closed]

Send headers along in python [duplicate]

How to download image to memory using python requests?

Accessing NumPy record array columns in Cython

scipy append all rows of one sparse matrix to another

Add argparse arguments from external modules

Cosine similarity for very large dataset

What exactly are the csv modules Dialect settings for excel-tab?

Python: how to make a recursive generator function