Concatenate Numpy arrays with least memory

2024/9/8 10:25:32

Not I have 50GB dataset saved as h5py, which is a dictionary inside. The dictionary contains keys from 0 to n, and the values are numpy ndarray(3 dimension) which have the same shape. For example:

dictionary[0] = np.array([[[...],[...]]...])

I want to concat all these np arrays, code like

sample = np.concatenate(list(dictionary.values))

this operation waste 100GB memory! If I use

del dictionary

It will decrease to 50GB memory. But I want to control the memory usage as 50GB during loading data. Another way I tried like this

    sample = np.concatenate(sample,dictionary[key])

It is still using 100GB memory. I think all the cases above, the right side will create a new memory block to save, and then assigned to the left side, which will double the memory during calculations. Thus, the third way I tried like this

sample = np.empty(shape)
with h5py.File(...) as dictionary:for key in dictionary.keys():sample[key] = dictionary[key]

I think this code has an advantage. The value dictionary[key] assigned to some row of sample, then the memory of dictionary[key] will clear. However, I test it and find that the memory usage is also 100GB. Why?

Are there any good methods to limit the memory usage as 50GB?


Your problem is that you need to have 2 copies of the same data in memory. If you build the array as in test1 you'll need far less memory at once, but at the cost of losing the dictionary.

import numpy as np
import time    def test1(n):a = {x:(x, x, x) for x in range(n)} # Build sample datab = np.array([a.pop(i) for i in range(n)]).reshape(-1)return bdef test2(n):a = {x:(x, x, x) for x in range(n)} # Build sample datab = np.concatenate(list(a.values()))return bx1 = test1(1000000)
del x1time.sleep(1)x2 = test2(1000000)


enter image description here

test1 : 0.71 s
test2 : 1.39 s

The first peek is for test1, it's not exactly in place but it reduces the memory usage quite a bit.

