Here's what I am trying to accomplish -
- I have about a million files which I need to parse & append the parsed content to a single file.
- Since a single process takes ages, this option is out.
- Not using threads in Python as it essentially comes to running a single process (due to GIL).
- Hence using multiprocessing module. i.e. spawning 4 sub-processes to utilize all that raw core power :)
So far so good, now I need a shared object which all the sub-processes have access to. I am using Queues from the multiprocessing module. Also, all the sub-processes need to write their output to a single file. A potential place to use Locks I guess. With this setup when I run, I do not get any error (so the parent process seems fine), it just stalls. When I press ctrl-C I see a traceback (one for each sub-process). Also no output is written to the output file. Here's code (note that everything runs fine without multi-processes) -
import os
import glob
from multiprocessing import Process, Queue, Pooldata_file = open('out.txt', 'w+')def worker(task_queue):for file in iter(task_queue.get, 'STOP'):data = mine_imdb_page(os.path.join(DATA_DIR, file))if data:data_file.write(repr(data)+'\n')returndef main():task_queue = Queue()for file in glob.glob('*.csv'):task_queue.put(file)task_queue.put('STOP') # so that worker processes know when to stop# this is the block of code that needs correction.if multi_process:# One way to spawn 4 processes# pool = Pool(processes=4) #Start worker processes# res = pool.apply_async(worker, [task_queue, data_file])# But I chose to do it like this for now.for i in range(4):proc = Process(target=worker, args=[task_queue])proc.start()else: # single process mode is working fine!worker(task_queue)data_file.close()return
what am I doing wrong? I also tried passing the open file_object to each of the processes at the time of spawning. But to no effect. e.g.- Process(target=worker, args=[task_queue, data_file])
. But this did not change anything. I feel the subprocesses are not able to write to the file for some reason. Either the instance of the file_object
is not getting replicated (at the time of spawn) or some other quirk... Anybody got an idea?
EXTRA: Also Is there any way to keep a persistent mysql_connection open & pass it across to the sub_processes? So I open a mysql connection in my parent process & the open connection should be accessible to all my sub-processes. Basically this is the equivalent of a shared_memory in python. Any ideas here?