Question 1

I'm having trouble computing the silhouette coefficient in python with sklearn. Here is my code :

from sklearn import datasets
from sklearn.metrics import *
iris = datasets.load_iris()
X = pd.DataFrame(iris.data, columns = col)
y = pd.DataFrame(iris.target,columns = ['cluster'])
s = silhouette_score(X, y, metric='euclidean',sample_size=int(50))

I get the error :

IndexError: indices are out-of-bounds

I want to use the sample_size parameter because when working with very large datasets, silhouette is too long to compute. Anyone knows how this parameter could work ?

Complete traceback :

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-72-70ff40842503> in <module>()4 X = pd.DataFrame(iris.data, columns = col)5 y = pd.DataFrame(iris.target,columns = ['cluster'])
----> 6 s = silhouette_score(X, y, metric='euclidean',sample_size=50)/usr/local/lib/python2.7/dist-packages/sklearn/metrics/cluster/unsupervised.pyc in silhouette_score(X, labels, metric, sample_size, random_state, **kwds)81             X, labels = X[indices].T[indices].T, labels[indices]82         else:
---> 83             X, labels = X[indices], labels[indices]84     return np.mean(silhouette_samples(X, labels, metric=metric, **kwds))85 /usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in __getitem__(self, key)1993         if isinstance(key, (np.ndarray, list)):1994             # either boolean or fancy integer index
-> 1995             return self._getitem_array(key)1996         elif isinstance(key, DataFrame):1997             return self._getitem_frame(key)/usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in _getitem_array(self, key)2030         else:2031             indexer = self.ix._convert_to_indexer(key, axis=1)
-> 2032             return self.take(indexer, axis=1, convert=True)2033 2034     def _getitem_multilevel(self, key):/usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in take(self, indices, axis, convert)2981         if convert:2982             axis = self._get_axis_number(axis)
-> 2983             indices = _maybe_convert_indices(indices, len(self._get_axis(axis)))2984 2985         if self._is_mixed_type:/usr/local/lib/python2.7/dist-packages/pandas/core/indexing.pyc in _maybe_convert_indices(indices, n)1038     mask = (indices>=n) | (indices<0)1039     if mask.any():
-> 1040         raise IndexError("indices are out-of-bounds")1041     return indices1042 IndexError: indices are out-of-bounds

Question 2

silhouette_score expects regular numpy arrays as input. Why wrap your arrays in data frames?

>>> silhouette_score(iris.data, iris.target, sample_size=50)
0.52999903616584543

From the traceback, you can observe that the code is doing fancy indexing (subsampling) on the first axis. By default indexing a dataframe will index the columns and not the rows hence the issue you observe.

silhouette coefficient in python with sklearn

Related Q&A

Force dask to_parquet to write single file

Unable to get python embedded to work with zipd library

Convert integer to a random but deterministically repeatable choice

Using python opencv to load image from zip

DeprecationWarning: Function when moving app (removed titlebar) - PySide6

Architecture solution for Python Web application

Capture image for processing

Loading Magnet LINK using Rasterbar libtorrent in Python

Python currying with any number of variables

Python - Display rows with repeated values in csv files