I have the following workflow, whereby I append data to an empty pandas Series object. (This empty array could also be a NumPy array, or even a basic list.)
in_memory_array = pd.Series([])for df in list_of_pandas_dataframes:new = df.apply(lambda row: compute_something(row), axis=1) ## new is a pandas.Seriesin_memory_array = in_memory_array.append(new)
My problem is that the resulting array in_memory_array
becomes too large for RAM. I don't need to keep all objects in memory for this computation.
I think my options are somehow pickling objects to disk once the array gets too big for RAM, e.g.
# N = some size in bytes too large for RAM
if sys.getsizeof(in_memory_array) > N: with open('mypickle.pickle', 'wb') as f:pickle.dump(in_memory_array, f)
Otherwise, is there an out-of-core solution? The best case scenario would be to create some cap such that the object cannot grow larger than X GB in RAM.