Question 1

Why pickling a sklearn decision tree can generate a pickle thousands times bigger (in terms of memory) than the original estimator?

I ran into this issue at work where a random forest estimator (with 100 decision trees) over a dataset with around 1_000_000 samples and 7 features generated a pickle bigger than 2GB.

I was able to track down the issue to the pickling of a single decision tree and I was able to replicate the issue with a generated dataset as below.

For memory estimations I used pympler library. Sklearn version used is 1.0.1

# here using a regressor tree but I would expect the same issue to be present with a classification tree
import pickle
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import make_friedman1  # using a dataset generation function from sklear
from pympler import asizeof# function that creates the dataset and trains the estimator
def make_example(n_samples: int):X, y = make_friedman1(n_samples=n_samples, n_features=7, noise=1.0, random_state=49)estimator = DecisionTreeRegressor(max_depth=50, max_features='auto', min_samples_split=5)estimator.fit(X, y)return X, y, estimator# utilities to compute and compare the size of an object and its pickled version
def readable_size(size_in_bytes: int, suffix='B') -> str:num = size_in_bytesfor unit in ['', 'k', 'M', 'G', 'T', 'P', 'E', 'Z']:if abs(num) < 1024.0:return "%3.1f %s%s" % (num, unit, suffix)num /= 1024.0return "%.1f%s%s" % (num, 'Yi', suffix)def print_size(obj, skip_detail=False):obj_size = asizeof.asized(obj).sizeprint(readable_size(obj_size))return obj_sizedef compare_with_pickle(obj):size_obj = print_size(obj)size_pickle = print_size(pickle.dumps(obj))print(f"Ratio pickle/obj: {(size_pickle / size_obj):.2f}")_, _, model100K = make_example(100_000)
compare_with_pickle(model100K)
_, _, model1M = make_example(1_000_000)
compare_with_pickle(model1M)

output:

1.7 kB
4.9 MB
Ratio pickle/obj: 2876.22
1.7 kB
49.3 MB
Ratio pickle/obj: 28982.84

Question 2

Preamble

asizeof usually outputs bad calculations when it is unfamiliar with how to resolve references in objects. By default, asizeof only traverses attributes for calculations,. There are exceptions, however— reference methods present in libraries such as numpy are hardcoded.

I suspect DecisionTreeRegressor has its own internal reference methods used to build a tree/graph that is not recognized by asizeof

Reducing output size

Depending on your requirements (python version, compatibility, time) you may be able to optimize for output size by changing the default protocol parameter for pickle to a protocol more space efficient.

There is also a built in module called pickletools that can be used to reduce space used by your pickled file (pickle tools.optimize). pickletools may also be used to disassemble the byte code.

Furthermore, you may compress the pickled output using built-in archiving modules.

References

https://github.com/pympler/pympler/blob/master/pympler/asizeof.py

https://docs.python.org/3/library/pickle.html

https://docs.python.org/3/library/pickletools.html#module-pickletools

https://docs.python.org/3/library/archiving.html

Why such a big pickle of a sklearn decision tree (30K times bigger)?

Preamble

Reducing output size

References

Related Q&A

Buffer size for reading UDP packets in Python

Why does datetime give different timezone formats for the same timezone?

Connect with pyppeteer to existing chrome

Combining asyncio with a multi-worker ProcessPoolExecutor and for async

Convert UTF-8 to string literals in Python

Memory usage not getting lowered even after job is completed successfully

How to output sklearn standardscaler

How to use Jobqueue in Python-telegram-bot

Is there a way to override default assert in pytest (python)?

How to install pycairo on osx?