Using h5py to create a hdf5-file with many datasets, I encounter a massive Speed drop after ca. 2,88 mio datasets. What is the reason for this?
I assume that the limit of the tree structure for the datasets is reached and so the tree has to be reordered, which is very time consuming.
Here is a short example:
import h5py
import timehdf5_file = h5py.File("C://TEMP//test.hdf5")barrier = 1
start = time.clock()
for i in range(int(1e8)):hdf5_file.create_dataset(str(i), [])td = time.clock() - startif td > barrier:print("{}: {}".format(int(td), i))barrier = int(td) + 1if td > 600: # cancel after 600sbreak
edit:
By grouping the datasets this limitation can be avoided:
import h5py
import timemax_n_keys = int(1e7)
max_n_group = int(1e5)hdf5_file = h5py.File("C://TEMP//test.hdf5", "w")
group_key= str(max_n_group)
hdf5_file.create_group(group_key)barrier=1
start = time.clock()
for i in range(max_n_keys):if i>max_n_group:max_n_group += int(1e5)group_key= str(max_n_group)hdf5_file.create_group(group_key)hdf5_file[group_key].create_dataset(str(i), data=[])td = time.clock() - startif td > barrier:print("{}: {}".format(int(td), i))barrier = int(td) + 1