Numpy efficient big matrix multiplication

2024/10/7 6:37:47

To store big matrix on disk I use numpy.memmap.

Here is a sample code to test big matrix multiplication:

import numpy as np
import timerows= 10000 # it can be large for example 1kk
cols= 1000#create some data in memory
data = np.arange(rows*cols, dtype='float32') 
data.resize((rows,cols))#create file on disk
fp0 = np.memmap('C:/data_0', dtype='float32', mode='w+', shape=(rows,cols))
fp1 = np.memmap('C:/data_1', dtype='float32', mode='w+', shape=(rows,cols))fp0[:]=data[:]
fp1[:]=data[:]#matrix transpose test
tr = np.memmap('C:/data_tr', dtype='float32', mode='w+', shape=(cols,rows))
tr= np.transpose(fp1)  #memory consumption?
print fp1.shape
print tr.shaperes = np.memmap('C:/data_res', dtype='float32', mode='w+', shape=(rows,rows))
t0 = time.time()
# redifinition ? res= np.dot(fp0,tr) #takes 342 seconds on my machine, if I multiplicate matrices in RAM it takes 345 seconds (I thinks it's a strange result)
res[:]= np.dot(fp0,tr) # assignment ?
print res.shape
print (time.time() - t0)

So my questions are :

  1. How to restrict memory consumtion of aplication which is using this procedure to some value for example to 100Mb(or 1Gb or something else).Also I don't understand how to estimate memory consumtion of procedure (I think memory is only allocated when "data" variable is created, but how much memory used when we use memmap files?)
  2. Maybe there is some optimal solution for multiplication of big matrices stored on disk? For example maybe data not optimally stored on disk or readed from disk, not properly chached, and also dot product use only one core.Maybe I should use something like PyTables?

Also I interested in algorithms solving linear system of equations (SVD and others) with restricted memory usage. Maybe this algorithms called out-of-core or iterative and I think there some analogy like hard drive<->ram, gpu ram<->cpu ram, cpu ram<->cpu cache.

Also here I found some info about matrix multiplication in PyTables.

Also I found this in R but I need it for Python or Matlab.

Answer

Dask.array provides a numpy interface to large on-disk arrays using blocked algorithms and task scheduling. It can easily do out-of-core matrix multiplies and other simple-ish numpy operations.

Blocked linear algebra is harder and you might want to check out some of the academic work on this topic. Dask does support QR and SVD factorizations on tall-and-skinny matrices.

Regardless for large arrays, you really want blocked algorithms, not naive traversals which will hit disk in unpleasant ways.

https://en.xdnf.cn/q/70279.html

Related Q&A

Basic parallel python program freezes on Windows

This is the basic Python example from https://docs.python.org/2/library/multiprocessing.html#module-multiprocessing.pool on parallel processingfrom multiprocessing import Pooldef f(x):return x*xif __na…

Formatting multiple worksheets using xlsxwriter

How to copy the same formatting to different sheets of the same Excel file using the xlsxwriter library in Python?The code I tried is: import xlsxwriterimport pandas as pd import numpy as npfrom xlsxw…

Good python library for generating audio files? [closed]

Closed. This question is seeking recommendations for books, tools, software libraries, and more. It does not meet Stack Overflow guidelines. It is not currently accepting answers.We don’t allow questi…

Keras ConvLSTM2D: ValueError on output layer

I am trying to train a 2D convolutional LSTM to make categorical predictions based on video data. However, my output layer seems to be running into a problem:"ValueError: Error when checking targe…

debugging argpars in python

May I know what is the best practice to debug an argpars function.Say I have a py file test_file.py with the following lines# Script start import argparse import os parser = argparse.ArgumentParser() p…

Non-blocking server in Twisted

I am building an application that needs to run a TCP server on a thread other than the main. When trying to run the following code:reactor.listenTCP(ServerConfiguration.tcpport, TcpCommandFactory()) re…

Reading changing file in Python 3 and Python 2

I was trying to read a changing file in Python, where a script can process newly appended lines. I have the script below which prints out the lines in a file and does not terminate.with open(tmp.txt,r)…

How to remove timestamps from celery pprint output?

When running the celery worker then each line of the output of the pprint is always prefixed by the timestamp and also is being stripped. This makes it quite unreadable:[2015-11-05 16:01:12,122: WARNIN…

How to get max() to return variable names instead of values in Python?

So I would like to get the maximum value from 3 variables, x,y,z.x = 1 y = 2 z = 3 max(x, y, z) # returns 3 but I want "z"However this returns the value of z i.e 3. How do I get the name of …

SymPy Imaginary Number

Im messing around with writing some SymPy code to handle symbolic expressions with imaginary numbers.To start out, I want to get it to take x and y as real numbers and find the solution where x=iy. So …