Using scipy kmeans for cluster analysis

2024/10/11 22:27:28

I want to understand scipy.cluster.vq.kmeans.

Having a number of points distributed in 2D space, the problem is to group them into clusters. This problem came to my attention reading this question and I was thinking that scipy.cluster.vq.kmeans would be way to go.

This is the data:
enter image description here

Using the following code, the aim would be to get the center point of each of the 25 clusters.

import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster.vq import vq, kmeans, whitenpos = np.arange(0,20,4)
scale = 0.4
size = 50
x = np.array([np.random.normal(i,scale,size*len(pos)) for i in pos]).flatten()
y = np.array([np.array([np.random.normal(i,scale,size) for i in pos]) for j in pos]).flatten()plt.scatter(x,y, s=16, alpha=0.4)#perform clustering with scipy.cluster.vq.kmeans
features = np.c_[x,y]# take raw data to cluster
clusters = kmeans(features,25)
p = clusters[0]
plt.scatter(p[:,0],p[:,1], s=81, c="crimson")# perform whitening (normalization to std) first
whitened = whiten(features) 
clustersw = kmeans(whitened,25)
q = clustersw[0]*features.std(axis=0)
plt.scatter(q[:,0],q[:,1], s=25, c="gold")plt.show()

The result looks like this:
enter image description here

The red dots mark the location of the cluster centers without whitening, the yellow points those with whitening being used. While they are different, the main problem is that they are obviously not all at the correct position. Because the clusters are all well separated, I'm having trouble to understand why this simple clustering fails.

I read this question which reports about kmeans not giving accurate results, but the answer is not really statisfactory. The suggested solution to use kmeans2 with minit='points' did not work either; i.e. kmeans2(features,25, minit='points') gives a similar result as the above.

So the question would be, is there a way to perform this easy clustering problem with scipy.cluster.vq.kmeans? And if so, how would I make sure to get the correct result.

Answer

On data like this, whitening does not make a difference: your x and y axes were already similarly scaled.

K-means does not reliably find the global optimum. It tends to get stuck in local optima. That is why it is common to use multiple runs and keep the best fit only, and to experiment with complex initialization procedures like k-means++.

https://en.xdnf.cn/q/118272.html

Related Q&A

Scrapy and celery `update_state`

I have the following setup (Docker):Celery linked to Flask setup which runs the Scrapy spider Flask setup (obviously) Flask setup gets request for Scrapy -> fire up worker to do some workNow I wish …

SPIDEV on raspberry pi for TI DAC8568 not behaving as expected

I have a Texas Instruments DAC8568 in their BOOST breakout board package. The DAC8568 is an 8 channel, 16bit DAC with SPI interface. The BOOST package has headers to connect it to my raspberry pi, an…

Tensorflow: Simple Linear Regression using CSV data

I am an extreme beginner at tensorflow, and i was tasked to do a simple linear regression using my csv data which contains 2 columns, Height & State of Charge(SoC), where both values are float. In …

How to resolve positional index error in python while solving a condition in python?

I have the following data and I am trying the following code: Name Sensex_index Start_Date End_Date AAA 0.5 20/08/2016 25/09/2016 AAA 0.8 26/08/2016 …

Google Calendar API: Insert multiple events (in Python)

I am using the Google Calendar API, and have successfully managed to insert a single event into an authorized primary calendar, but I would like to hard code multiple events that, when executed, would …

Remove special characters from column headers

I have a dictionary (data_final) of dataframes (health, education, economy,...). The dataframes contain data from one xlsx file. In one of the dataframes (economy), the column names have brackets and s…

Python Flask application getting OPTIONS instead of POST

I have a python Flask listener waiting on port 8080. I expect another process to make a series of POSTs to this port.The code for listener is as follows.#!/usr/bin/env python2 from __future__ import pr…

Raspberry pi:convert fisheye image to normal image using python

I have attached the USB webcam with raspberry pi to capture image and write code to send it using mail. It captures image using fswebcam commamnd so code for capture image in python script is :subproce…

modifying python daemon script, stop does not return OK (but does kill the process)

Following on from the previous post, the script now start and stops the python script (and only that particular script) correctly but does not report the OK back to the screen...USER="root" A…

fulfill an empty dataframe with common index values from another Daframe

I have a daframe with a series of period 1 month and frequency one second.The problem the time step between records is not always 1 second.time c1 c2 2013-01-01 00:00:01 5 3 2013-01-0…