Numpy array larger than RAM: write to disk or out-of-core solution?

2024/11/15 23:54:13

I have the following workflow, whereby I append data to an empty pandas Series object. (This empty array could also be a NumPy array, or even a basic list.)

in_memory_array = pd.Series([])for df in list_of_pandas_dataframes:new = df.apply(lambda row: compute_something(row), axis=1)  ## new is a pandas.Seriesin_memory_array = in_memory_array.append(new)

My problem is that the resulting array in_memory_array becomes too large for RAM. I don't need to keep all objects in memory for this computation.

I think my options are somehow pickling objects to disk once the array gets too big for RAM, e.g.

# N = some size in bytes too large for RAM
if sys.getsizeof(in_memory_array) > N: with open('mypickle.pickle', 'wb') as f:pickle.dump(in_memory_array, f)

Otherwise, is there an out-of-core solution? The best case scenario would be to create some cap such that the object cannot grow larger than X GB in RAM.

Answer

Check out this python library : https://pypi.org/project/wendelin.core/ It allows you to work with arrays bigger than RAM and local disk.

https://en.xdnf.cn/q/71407.html

Related Q&A

Pandas DataFrame styler - How to style pandas dataframe as excel table?

How to style the pandas dataframe as an excel table (alternate row colour)? Sample style:Sample data: import pandas as pd import seaborn as snsdf = sns.load_dataset("tips")

Remove namespace with xmltodict in Python

xmltodict converts XML to a Python dictionary. It supports namespaces. I can follow the example on the homepage and successfully remove a namespace. However, I cannot remove the namespace from my XM…

Groupby count only when a certain value is present in one of the column in pandas

I have a dataframe similar to the below mentioned database:+------------+-----+--------+| time | id | status |+------------+-----+--------+| 1451606400 | id1 | Yes || 1451606400 | id1 | Yes …

how to save tensorflow model to pickle file

I want to save a Tensorflow model and then later use it for deployment purposes. I dont want to use model.save() to save it because my purpose is to somehow pickle it and use it in a different system w…

PySide2 Qt3D mesh does not show up

Im diving into Qt3D framework and have decided to replicate a simplified version of this c++ exampleUnfortunately, I dont see a torus mesh on application start. Ive created all required entities and e…

Unable to import module lambda_function: No module named psycopg2._psycopg aws lambda function

I have installed the psycopg2 with this command in my package folder : pip install --target ./package psycopg2 # Or pip install -t ./package psycopg2now psycopg2 module is in my package and I have crea…

RestrictedPython: Call other functions within user-specified code?

Using Yuri Nudelmans code with the custom _import definition to specify modules to restrict serves as a good base but when calling functions within said user_code naturally due to having to whitelist e…

TypeError: object of type numpy.int64 has no len()

I am making a DataLoader from DataSet in PyTorch. Start from loading the DataFrame with all dtype as an np.float64result = pd.read_csv(dummy.csv, header=0, dtype=DTYPE_CLEANED_DF)Here is my dataset cla…

VS Code Pylance not highlighting variables and modules

Im using VS Code with the Python and Pylance extensions. Im having a problem with the Pylance extension not doing syntax highlight for things like modules and my dataframe. I would expect the modules…

How to compute Spearman correlation in Tensorflow

ProblemI need to compute the Pearson and Spearman correlations, and use it as metrics in tensorflow.For Pearson, its trivial :tf.contrib.metrics.streaming_pearson_correlation(y_pred, y_true)But for Spe…