Question 1

I want to understand scipy.cluster.vq.kmeans.

Having a number of points distributed in 2D space, the problem is to group them into clusters. This problem came to my attention reading this question and I was thinking that scipy.cluster.vq.kmeans would be way to go.

This is the data:
enter image description here

Using the following code, the aim would be to get the center point of each of the 25 clusters.

import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster.vq import vq, kmeans, whitenpos = np.arange(0,20,4)
scale = 0.4
size = 50
x = np.array([np.random.normal(i,scale,size*len(pos)) for i in pos]).flatten()
y = np.array([np.array([np.random.normal(i,scale,size) for i in pos]) for j in pos]).flatten()plt.scatter(x,y, s=16, alpha=0.4)#perform clustering with scipy.cluster.vq.kmeans
features = np.c_[x,y]# take raw data to cluster
clusters = kmeans(features,25)
p = clusters[0]
plt.scatter(p[:,0],p[:,1], s=81, c="crimson")# perform whitening (normalization to std) first
whitened = whiten(features) 
clustersw = kmeans(whitened,25)
q = clustersw[0]*features.std(axis=0)
plt.scatter(q[:,0],q[:,1], s=25, c="gold")plt.show()

The result looks like this:
enter image description here

The red dots mark the location of the cluster centers without whitening, the yellow points those with whitening being used. While they are different, the main problem is that they are obviously not all at the correct position. Because the clusters are all well separated, I'm having trouble to understand why this simple clustering fails.

I read this question which reports about kmeans not giving accurate results, but the answer is not really statisfactory. The suggested solution to use kmeans2 with minit='points' did not work either; i.e. kmeans2(features,25, minit='points') gives a similar result as the above.

So the question would be, is there a way to perform this easy clustering problem with scipy.cluster.vq.kmeans? And if so, how would I make sure to get the correct result.

Question 2

On data like this, whitening does not make a difference: your x and y axes were already similarly scaled.

K-means does not reliably find the global optimum. It tends to get stuck in local optima. That is why it is common to use multiple runs and keep the best fit only, and to experiment with complex initialization procedures like k-means++.

Using scipy kmeans for cluster analysis

Related Q&A

Scrapy and celery `update_state`

SPIDEV on raspberry pi for TI DAC8568 not behaving as expected

Tensorflow: Simple Linear Regression using CSV data

How to resolve positional index error in python while solving a condition in python?

Google Calendar API: Insert multiple events (in Python)

Remove special characters from column headers

Python Flask application getting OPTIONS instead of POST

Raspberry pi:convert fisheye image to normal image using python

modifying python daemon script, stop does not return OK (but does kill the process)

fulfill an empty dataframe with common index values from another Daframe