Relation between 2D KDE bandwidth in sklearn vs bandwidth in scipy

2024/10/1 15:11:49

I'm attempting to compare the performance of sklearn.neighbors.KernelDensity versus scipy.stats.gaussian_kde for a two dimensional array.

From this article I see that the bandwidths (bw) are treated differently in each function. The article gives a recipe for setting the correct bw in scipy so it will be equivalent to the one used in sklearn . Basically it divides the bw by the sample standard deviation. The result is this:

# For sklearn
bw = 0.15# For scipy
bw = 0.15/x.std(ddof=1)

where x is the sample array I'm using to obtain the KDE. This works just fine in 1D, but I can't make it work in 2D.

Here's a MWE of what I got:

import numpy as np
from scipy import stats
from sklearn.neighbors import KernelDensity# Generate random data.
n = 1000
m1, m2 = np.random.normal(0.2, 0.2, size=n), np.random.normal(0.2, 0.2, size=n)
# Define limits.
xmin, xmax = min(m1), max(m1)
ymin, ymax = min(m2), max(m2)
# Format data.
x, y = np.mgrid[xmin:xmax:100j, ymin:ymax:100j]
positions = np.vstack([x.ravel(), y.ravel()])
values = np.vstack([m1, m2])# Define some point to evaluate the KDEs.
x1, y1 = 0.5, 0.5# -------------------------------------------------------
# Perform a kernel density estimate on the data using scipy.
kernel = stats.gaussian_kde(values, bw_method=0.15/np.asarray(values).std(ddof=1))
# Get KDE value for the point.
iso1 = kernel((x1,y1))
print 'iso1 = ', iso[0]# -------------------------------------------------------
# Perform a kernel density estimate on the data using sklearn.
kernel_sk = KernelDensity(kernel='gaussian', bandwidth=0.15).fit(zip(*values))
# Get KDE value for the point.
iso2 = kernel_sk.score_samples([[x1, y1]])
print 'iso2 = ', np.exp(iso2[0])

( iso2 is presented as an exponential since sklearn returns the log values)

The results I get for iso1 and iso2 are different and I'm lost as to how should I affect the bandwidth (in either function) to make them equal (as they should).


Add

I was advised over at sklearn chat (by ep) that I should scale the values in (x,y) before calculating the kernel with scipy in order to obtain comparable results with sklearn.

So this is what I did:

# Scale values.
x_val_sca = np.asarray(values[0])/np.asarray(values).std(axis=1)[0]
y_val_sca = np.asarray(values[1])/np.asarray(values).std(axis=1)[1]
values = [x_val_sca, y_val_sca]
kernel = stats.gaussian_kde(values, bw_method=bw_value)

ie: I scaled both dimensions before getting the kernel with scipy while leaving the line that obtains the kernel in sklearn untouched.

This gave better results but there's still differences in the kernels obtained:

kernels

where the red dot is the (x1,y1) point in the code. So as can be seen, there are still differences in the shapes of the density estimates, albeit very small ones. Perhaps this is the best that can be achieved?

Answer

A couple of years later I tried this and think I got it to work with no re-scaling needed for the data. Bandwidth values do need some scaling though:

# For sklearn
bw = 0.15# For scipy
bw = 0.15/x.std(ddof=1)

The evaluation of both KDEs for the same point is not exactly equal. For example here's an evaluation for the (x1, y1) point:

iso1 =  0.00984751705005  # Scipy
iso2 =  0.00989788224787  # Sklearn

but I guess it's close enough.

Here's the MWE for the 2D case and the output which, as far as I can see, look almost exactly the same:

enter image description here

import numpy as np
from scipy import stats
from sklearn.neighbors import KernelDensity
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec# Generate random data.
n = 1000
m1, m2 = np.random.normal(-3., 3., size=n), np.random.normal(-3., 3., size=n)
# Define limits.
xmin, xmax = min(m1), max(m1)
ymin, ymax = min(m2), max(m2)
ext_range = [xmin, xmax, ymin, ymax]
# Format data.
x, y = np.mgrid[xmin:xmax:100j, ymin:ymax:100j]
positions = np.vstack([x.ravel(), y.ravel()])
values = np.vstack([m1, m2])# Define some point to evaluate the KDEs.
x1, y1 = 0.5, 0.5
# Bandwidth value.
bw = 0.15# -------------------------------------------------------
# Perform a kernel density estimate on the data using scipy.
# **Bandwidth needs to be scaled to match Sklearn results**
kernel = stats.gaussian_kde(values, bw_method=bw/np.asarray(values).std(ddof=1))
# Get KDE value for the point.
iso1 = kernel((x1, y1))
print 'iso1 = ', iso1[0]# -------------------------------------------------------
# Perform a kernel density estimate on the data using sklearn.
kernel_sk = KernelDensity(kernel='gaussian', bandwidth=bw).fit(zip(*values))
# Get KDE value for the point. Use exponential since sklearn returns the
# log values
iso2 = np.exp(kernel_sk.score_samples([[x1, y1]]))
print 'iso2 = ', iso2[0]# Plot
fig = plt.figure(figsize=(10, 10))
gs = gridspec.GridSpec(1, 2)# Scipy
plt.subplot(gs[0])
plt.title("Scipy", x=0.5, y=0.92, fontsize=10)
# Evaluate kernel in grid positions.
k_pos = kernel(positions)
kde = np.reshape(k_pos.T, x.shape)
plt.imshow(np.rot90(kde), cmap=plt.cm.YlOrBr, extent=ext_range)
plt.contour(x, y, kde, 5, colors='k', linewidths=0.6)# Sklearn
plt.subplot(gs[1])
plt.title("Sklearn", x=0.5, y=0.92, fontsize=10)
# Evaluate kernel in grid positions.
k_pos2 = np.exp(kernel_sk.score_samples(zip(*positions)))
kde2 = np.reshape(k_pos2.T, x.shape)
plt.imshow(np.rot90(kde2), cmap=plt.cm.YlOrBr, extent=ext_range)
plt.contour(x, y, kde2, 5, colors='k', linewidths=0.6)fig.tight_layout()
plt.savefig('KDEs', dpi=300, bbox_inches='tight')
https://en.xdnf.cn/q/70957.html

Related Q&A

How to style (rich text) in QListWidgetItem and QCombobox items? (PyQt/PySide)

I have found similar questions being asked, but without answers or where the answer is an alternative solution.I need to create a breadcrumb trail in both QComboBoxes and QListWidgets (in PySide), and …

Reconnecting to device with pySerial

I am currently having a problem with the pySerial module in Python. My problem relates to connecting and disconnecting to a device. I can successfully connect to my device and communicate with it for a…

How to check if a sentence is a question with spacy?

I am using spacy library to build a chat bot. How do I check if a document is a question with a certain confidence? I know how to do relevance, but not sure how to filter statements from questions.I a…

Python: Lifetime of module-global variables

I have a shared resource with high initialisation cost and thus I want to access it across the system (its used for some instrumentation basically, so has to be light weight). So I created a module man…

Power spectrum with Cython

I am trying to optimize my code with Cython. It is doing a a power spectrum, not using FFT, because this is what we were told to do in class. Ive tried to write to code in Cython, but do not see any di…

Detect abbreviations in the text in python

I want to find abbreviations in the text and remove it. What I am currently doing is identifying consecutive capital letters and remove them.But I see that it does not remove abbreviations such as MOOC…

filtering of tweets received from statuses/filter (streaming API)

I have N different keywords that i am tracking (for sake of simplicity, let N=3). So in GET statuses/filter, I will give 3 keywords in the "track" argument.Now the tweets that i will be recei…

UndefinedError: current_user is undefined

I have a app with flask which works before But Now I use Blueprint in it and try to run it but got the error so i wonder that is the problem Blueprint that g.user Not working? and how can I fix it Thn…

Scipy filter with multi-dimensional (or non-scalar) output

Is there a filter similar to ndimages generic_filter that supports vector output? I did not manage to make scipy.ndimage.filters.generic_filter return more than a scalar. Uncomment the line in the cod…

How do I stop execution inside exec command in Python 3?

I have a following code:code = """ print("foo")if True: returnprint("bar") """exec(code) print(This should still be executed)If I run it I get:Tracebac…