To store big matrix on disk I use numpy.memmap.
Here is a sample code to test big matrix multiplication:
import numpy as np
import timerows= 10000 # it can be large for example 1kk
cols= 1000#create some data in memory
data = np.arange(rows*cols, dtype='float32')
data.resize((rows,cols))#create file on disk
fp0 = np.memmap('C:/data_0', dtype='float32', mode='w+', shape=(rows,cols))
fp1 = np.memmap('C:/data_1', dtype='float32', mode='w+', shape=(rows,cols))fp0[:]=data[:]
fp1[:]=data[:]#matrix transpose test
tr = np.memmap('C:/data_tr', dtype='float32', mode='w+', shape=(cols,rows))
tr= np.transpose(fp1) #memory consumption?
print fp1.shape
print tr.shaperes = np.memmap('C:/data_res', dtype='float32', mode='w+', shape=(rows,rows))
t0 = time.time()
# redifinition ? res= np.dot(fp0,tr) #takes 342 seconds on my machine, if I multiplicate matrices in RAM it takes 345 seconds (I thinks it's a strange result)
res[:]= np.dot(fp0,tr) # assignment ?
print res.shape
print (time.time() - t0)
So my questions are :
- How to restrict memory consumtion of aplication which is using this procedure to some value for example to 100Mb(or 1Gb or something else).Also I don't understand how to estimate memory consumtion of procedure (I think memory is only allocated when "data" variable is created, but how much memory used when we use memmap files?)
- Maybe there is some optimal solution for multiplication of big matrices stored on disk? For example maybe data not optimally stored on disk or readed from disk, not properly chached, and also dot product use only one core.Maybe I should use something like PyTables?
Also I interested in algorithms solving linear system of equations (SVD and others) with restricted memory usage. Maybe this algorithms called out-of-core or iterative and I think there some analogy like hard drive<->ram, gpu ram<->cpu ram, cpu ram<->cpu cache.
Also here I found some info about matrix multiplication in PyTables.
Also I found this in R but I need it for Python or Matlab.