Numpy array larger than RAM: write to disk or out-of-core solution?

2024/11/15 23:54:13

I have the following workflow, whereby I append data to an empty pandas Series object. (This empty array could also be a NumPy array, or even a basic list.)

in_memory_array = pd.Series([])for df in list_of_pandas_dataframes:new = df.apply(lambda row: compute_something(row), axis=1)  ## new is a pandas.Seriesin_memory_array = in_memory_array.append(new)

My problem is that the resulting array in_memory_array becomes too large for RAM. I don't need to keep all objects in memory for this computation.

I think my options are somehow pickling objects to disk once the array gets too big for RAM, e.g.

# N = some size in bytes too large for RAM
if sys.getsizeof(in_memory_array) > N: with open('mypickle.pickle', 'wb') as f:pickle.dump(in_memory_array, f)

Otherwise, is there an out-of-core solution? The best case scenario would be to create some cap such that the object cannot grow larger than X GB in RAM.

Answer

Check out this python library : https://pypi.org/project/wendelin.core/ It allows you to work with arrays bigger than RAM and local disk.

Numpy array larger than RAM: write to disk or out-of-core solution?

Related Q&A

Pandas DataFrame styler - How to style pandas dataframe as excel table?

Remove namespace with xmltodict in Python

Groupby count only when a certain value is present in one of the column in pandas

how to save tensorflow model to pickle file

PySide2 Qt3D mesh does not show up

Unable to import module lambda_function: No module named psycopg2._psycopg aws lambda function

RestrictedPython: Call other functions within user-specified code?

TypeError: object of type numpy.int64 has no len()

VS Code Pylance not highlighting variables and modules

How to compute Spearman correlation in Tensorflow