I want to understand scipy.cluster.vq.kmeans
.
Having a number of points distributed in 2D space, the problem is to group them into clusters. This problem came to my attention reading this question and I was thinking that scipy.cluster.vq.kmeans
would be way to go.
This is the data:
Using the following code, the aim would be to get the center point of each of the 25 clusters.
import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster.vq import vq, kmeans, whitenpos = np.arange(0,20,4)
scale = 0.4
size = 50
x = np.array([np.random.normal(i,scale,size*len(pos)) for i in pos]).flatten()
y = np.array([np.array([np.random.normal(i,scale,size) for i in pos]) for j in pos]).flatten()plt.scatter(x,y, s=16, alpha=0.4)#perform clustering with scipy.cluster.vq.kmeans
features = np.c_[x,y]# take raw data to cluster
clusters = kmeans(features,25)
p = clusters[0]
plt.scatter(p[:,0],p[:,1], s=81, c="crimson")# perform whitening (normalization to std) first
whitened = whiten(features)
clustersw = kmeans(whitened,25)
q = clustersw[0]*features.std(axis=0)
plt.scatter(q[:,0],q[:,1], s=25, c="gold")plt.show()
The result looks like this:
The red dots mark the location of the cluster centers without whitening, the yellow points those with whitening being used. While they are different, the main problem is that they are obviously not all at the correct position. Because the clusters are all well separated, I'm having trouble to understand why this simple clustering fails.
I read this question which reports about kmeans
not giving accurate results, but the answer is not really statisfactory. The suggested solution to use kmeans2
with minit='points'
did not work either; i.e. kmeans2(features,25, minit='points')
gives a similar result as the above.
So the question would be, is there a way to perform this easy clustering problem with scipy.cluster.vq.kmeans
? And if so, how would I make sure to get the correct result.