Why is cross_val_predict so much slower than fit for KNeighborsClassifier?

2024/9/28 3:21:27

Running locally on a Jupyter notebook and using the MNIST dataset (28k entries, 28x28 pixels per image, the following takes 27 seconds.

from sklearn.neighbors import KNeighborsClassifierknn_clf = KNeighborsClassifier(n_jobs=1)
knn_clf.fit(pixels, labels)

However, the following takes 1722 seconds, in other words ~64 times longer:

from sklearn.model_selection import cross_val_predict
y_train_pred = cross_val_predict(knn_clf, pixels, labels, cv = 3, n_jobs=1)

My naive understanding is that cross_val_predict with cv=3 is doing 3-fold cross validation, so I'd expect it to fit the model 3 times, and so take at least ~3 times longer, but I don't see why it would take 64x!

To check if it was something specific to my environment, I ran the same in a Colab notebook - the difference was less extreme (15x), but still way above the ~3x I expected:

What am I missing? Why is cross_val_predict so much slower than just fitting the model?

In case it matters, I'm running scikit-learn 0.20.2.

Answer

KNN is also called as lazy algorithm because during fitting it does nothing but saves the input data, specifically there is no learning at all.

During predict is the actual distance calculation happens for each test datapoint. Hence, you could understand that when using cross_val_predict, KNN has to predict on the validation data points, which makes the computation time higher!

https://en.xdnf.cn/q/71389.html

Related Q&A

Do I need to do any text cleaning for Spacy NER?

I am new to NER and Spacy. Trying to figure out what, if any, text cleaning needs to be done. Seems like some examples Ive found trim the leading and trailing whitespace and then muck with the start/st…

Hi , I have error related to object detection project

I have error related to simple object detection .output_layers = [layer_names[i[0] - 1] for i in net.getUnconnectedOutLayers()] IndexError: invalid index to scalar variable.import cv2.cv2 as cv import…

What is the fastest way to calculate / create powers of ten?

If as the input you provide the (integer) power, what is the fastest way to create the corresponding power of ten? Here are four alternatives I could come up with, and the fastest way seems to be usin…

How to disable date interpolation in matplotlib?

Despite trying some solutions available on SO and at Matplotlibs documentation, Im still unable to disable Matplotlibs creation of weekend dates on the x-axis.As you can see see below, it adds dates to…

Continuous error band with Plotly Express in Python [duplicate]

This question already has answers here:Plotly: How to make a figure with multiple lines and shaded area for standard deviations?(5 answers)Closed 2 years ago.I need to plot data with continuous error …

How to preprocess training set for VGG16 fine tuning in Keras?

I have fine tuned the Keras VGG16 model, but Im unsure about the preprocessing during the training phase.I create a train generator as follow:train_datagen = ImageDataGenerator(rescale=1./255) train_ge…

Using Python like PHP in Apache/Windows

I understand that I should use mod_wsgi to run Python, and I have been trying to get that set up, but Im confused about it:This is a sample configuration I found for web.py:LoadModule wsgi_module modul…

django-oauth-toolkit : Customize authenticate response

I am new to Django OAuth Toolkit. I want to customize the authenticate response.My authenticate url configuration on django application is : url(authenticate/,include(oauth2_provider.urls, namespace=oa…

Pushing local branch to remote branch

I created new repository in my Github repository.Using the gitpython library Im able to get this repository. Then I create new branch, add new file, commit and try to push to the new branch.Please chec…

Does Pandas, SciPy, or NumPy provide a cumulative standard deviation function?

I have a Pandas series. I need to get sigma_i, which is the standard deviation of a series up to index i. Is there an existing function which efficiently calculates that? I noticed that there are the …