Question 1

I am trying to speed up the process of reading chunks (load them into RAM memory) out of a h5py dataset file. Right now I try to do this via the multiprocessing library.

pool = mp.Pool(NUM_PROCESSES)
gen = pool.imap(loader, indices)

Where the loader function is something like this:

def loader(indices):with h5py.File("location", 'r') as dataset:x = dataset["name"][indices]

This actually sometimes works (meaning that the expected loading time is divided by the number of processes and thus parallelized). However, most of the time it doesn't and the loading time just stays as high as it was when loading the data sequentially. Is there anything I can do to fix this? I know h5py supports parallel read/writes through mpi4py but I would just want to know if that is absolutely necessary for only reads as well.

Question 2

Parallel reads are fine with h5py, no need for the MPI version. But why do you expect a speed-up here? Your job is almost entirely I/O bound, not CPU bound. Parallel processes are not gonna help because the bottleneck is your hard disk, not the CPU. It wouldn't surprise me if parallelization in this case even slowed down the whole reading operation. Other opinions?

Is it possible to do parallel reads on one h5py file using multiprocessing?

Related Q&A

Where is a django validator functions return value stored?

Modifying YAML using ruamel.yaml adds extra new lines

How to get the background color of a button or label (QPushButton, QLabel) in PyQt

Is it possible to make sql join on several fields using peewee python ORM?

Django multiple form factory

How to include the private key in paramiko after fetching from string?

SHA 512 crypt output written with Python code is different from mkpasswd

Running python scripts in Anaconda environment through Windows cmd

How to work out ComplexWarning: Casting complex values to real discards the imaginary part?

Is it possible to use POD(plain old documentation) with Python?