Why such a big pickle of a sklearn decision tree (30K times bigger)?

2024/10/14 22:22:25

Why pickling a sklearn decision tree can generate a pickle thousands times bigger (in terms of memory) than the original estimator?

I ran into this issue at work where a random forest estimator (with 100 decision trees) over a dataset with around 1_000_000 samples and 7 features generated a pickle bigger than 2GB.

I was able to track down the issue to the pickling of a single decision tree and I was able to replicate the issue with a generated dataset as below.

For memory estimations I used pympler library. Sklearn version used is 1.0.1

# here using a regressor tree but I would expect the same issue to be present with a classification tree
import pickle
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import make_friedman1  # using a dataset generation function from sklear
from pympler import asizeof# function that creates the dataset and trains the estimator
def make_example(n_samples: int):X, y = make_friedman1(n_samples=n_samples, n_features=7, noise=1.0, random_state=49)estimator = DecisionTreeRegressor(max_depth=50, max_features='auto', min_samples_split=5)estimator.fit(X, y)return X, y, estimator# utilities to compute and compare the size of an object and its pickled version
def readable_size(size_in_bytes: int, suffix='B') -> str:num = size_in_bytesfor unit in ['', 'k', 'M', 'G', 'T', 'P', 'E', 'Z']:if abs(num) < 1024.0:return "%3.1f %s%s" % (num, unit, suffix)num /= 1024.0return "%.1f%s%s" % (num, 'Yi', suffix)def print_size(obj, skip_detail=False):obj_size = asizeof.asized(obj).sizeprint(readable_size(obj_size))return obj_sizedef compare_with_pickle(obj):size_obj = print_size(obj)size_pickle = print_size(pickle.dumps(obj))print(f"Ratio pickle/obj: {(size_pickle / size_obj):.2f}")_, _, model100K = make_example(100_000)
compare_with_pickle(model100K)
_, _, model1M = make_example(1_000_000)
compare_with_pickle(model1M)

output:

1.7 kB
4.9 MB
Ratio pickle/obj: 2876.22
1.7 kB
49.3 MB
Ratio pickle/obj: 28982.84
Answer

Preamble

asizeof usually outputs bad calculations when it is unfamiliar with how to resolve references in objects. By default, asizeof only traverses attributes for calculations,. There are exceptions, however— reference methods present in libraries such as numpy are hardcoded.

I suspect DecisionTreeRegressor has its own internal reference methods used to build a tree/graph that is not recognized by asizeof

Reducing output size

Depending on your requirements (python version, compatibility, time) you may be able to optimize for output size by changing the default protocol parameter for pickle to a protocol more space efficient.

There is also a built in module called pickletools that can be used to reduce space used by your pickled file (pickle tools.optimize). pickletools may also be used to disassemble the byte code.

Furthermore, you may compress the pickled output using built-in archiving modules.

References

https://github.com/pympler/pympler/blob/master/pympler/asizeof.py

https://docs.python.org/3/library/pickle.html

https://docs.python.org/3/library/pickletools.html#module-pickletools

https://docs.python.org/3/library/archiving.html

https://en.xdnf.cn/q/69363.html

Related Q&A

Buffer size for reading UDP packets in Python

I am trying to find out / adjust the size of network buffers:import socketsock = socket.socket(socket.AF_INET,socket.SOCK_DGRAM)sock.getsockopt(socket.SOL_SOCKET,socket.SO_RCVBUF) 212992What on earth i…

Why does datetime give different timezone formats for the same timezone?

>>> now = datetime.datetime.now(pytz.timezone(Asia/Tokyo)) >>> dt = datetime(now.year, now.month, now.day, now.hour, now.minute, now.second, now.microsecond, pytz.timezone(Asia/Tokyo)…

Connect with pyppeteer to existing chrome

I want to connect to an existing (already opened, by the user, without any extra flags) Chrome browser using pyppeteer so I would be able to control it. I can do almost every manual action before (for …

Combining asyncio with a multi-worker ProcessPoolExecutor and for async

My question is very similar to Combining asyncio with a multi-worker ProcessPoolExecutor - however a slight change (I believe its the async for) makes the excellent answers there unusuable for me.I am …

Convert UTF-8 to string literals in Python

I have a string in UTF-8 format but not so sure how to convert this string to its corresponding character literal. For example I have the string:My string is: Entre\xc3\xa9Example one:This code:uEntre\…

Memory usage not getting lowered even after job is completed successfully

I have a job added in apscheduler which loads some data in memory and I am deleting all the objects after the job is complete. Now if I run this job with python it works successfully and memory drop af…

How to output sklearn standardscaler

I have standardized my data in sklearn using preprocessing.standardscaler. Question is how could I save this in my local for latter use?Thanks

How to use Jobqueue in Python-telegram-bot

I have able to make a bot very easily by reading the docs but Jobqueue is not working as per it is written. The run_daily method uses a datetime.time object to send the message at a particular time but…

Is there a way to override default assert in pytest (python)?

Id like to a log some information to a file/database every time assert is invoked. Is there a way to override assert or register some sort of callback function to do this, every time assert is invoked?…

How to install pycairo on osx?

I am trying to install the pycairo (Python bindings for the cairo graphics library) under OSX.I started witheasy_install pycairoand got: Requested cairo >= 1.8.8 but version of cairo is 1.0.4error: …