silhouette coefficient in python with sklearn

2024/11/16 19:48:58

I'm having trouble computing the silhouette coefficient in python with sklearn. Here is my code :

from sklearn import datasets
from sklearn.metrics import *
iris = datasets.load_iris()
X = pd.DataFrame(iris.data, columns = col)
y = pd.DataFrame(iris.target,columns = ['cluster'])
s = silhouette_score(X, y, metric='euclidean',sample_size=int(50))

I get the error :

IndexError: indices are out-of-bounds

I want to use the sample_size parameter because when working with very large datasets, silhouette is too long to compute. Anyone knows how this parameter could work ?

Complete traceback :

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-72-70ff40842503> in <module>()4 X = pd.DataFrame(iris.data, columns = col)5 y = pd.DataFrame(iris.target,columns = ['cluster'])
----> 6 s = silhouette_score(X, y, metric='euclidean',sample_size=50)/usr/local/lib/python2.7/dist-packages/sklearn/metrics/cluster/unsupervised.pyc in silhouette_score(X, labels, metric, sample_size, random_state, **kwds)81             X, labels = X[indices].T[indices].T, labels[indices]82         else:
---> 83             X, labels = X[indices], labels[indices]84     return np.mean(silhouette_samples(X, labels, metric=metric, **kwds))85 /usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in __getitem__(self, key)1993         if isinstance(key, (np.ndarray, list)):1994             # either boolean or fancy integer index
-> 1995             return self._getitem_array(key)1996         elif isinstance(key, DataFrame):1997             return self._getitem_frame(key)/usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in _getitem_array(self, key)2030         else:2031             indexer = self.ix._convert_to_indexer(key, axis=1)
-> 2032             return self.take(indexer, axis=1, convert=True)2033 2034     def _getitem_multilevel(self, key):/usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in take(self, indices, axis, convert)2981         if convert:2982             axis = self._get_axis_number(axis)
-> 2983             indices = _maybe_convert_indices(indices, len(self._get_axis(axis)))2984 2985         if self._is_mixed_type:/usr/local/lib/python2.7/dist-packages/pandas/core/indexing.pyc in _maybe_convert_indices(indices, n)1038     mask = (indices>=n) | (indices<0)1039     if mask.any():
-> 1040         raise IndexError("indices are out-of-bounds")1041     return indices1042 IndexError: indices are out-of-bounds
Answer

silhouette_score expects regular numpy arrays as input. Why wrap your arrays in data frames?

>>> silhouette_score(iris.data, iris.target, sample_size=50)
0.52999903616584543

From the traceback, you can observe that the code is doing fancy indexing (subsampling) on the first axis. By default indexing a dataframe will index the columns and not the rows hence the issue you observe.

https://en.xdnf.cn/q/71305.html

Related Q&A

Force dask to_parquet to write single file

When using dask.to_parquet(df, filename) a subfolder filename is created and several files are written to that folder, whereas pandas.to_parquet(df, filename) writes exactly one file. Can I use dasks t…

Unable to get python embedded to work with zipd library

Im trying to embed python, and provide the dll and a zip of the python libraries and not use any installed python. That is, if a user doesnt have python, I want my code to work using the provided dll/…

Convert integer to a random but deterministically repeatable choice

How do I convert an unsigned integer (representing a user ID) to a random looking but actually a deterministically repeatable choice? The choice must be selected with equal probability (irrespective o…

Using python opencv to load image from zip

I am able to successfully load an image from a zip:with zipfile.ZipFile(test.zip, r) as zfile:data = zfile.read(test.jpg)# how to open this using imread or imdecode?The question is: how can I open thi…

DeprecationWarning: Function when moving app (removed titlebar) - PySide6

I get when I move the App this Warning: C:\Qt\Login_Test\main.py:48: DeprecationWarning: Function: globalPos() const is marked as deprecated, please check the documentation for more information.self.dr…

Architecture solution for Python Web application

Were setting up a Python REST web application. Right now, were using WSGI, but we might do some changes to that in the future (using Twisted, for example, to improve on scalability or some other featu…

Capture image for processing

Im using Python with PIL and SciPy. i want to capture an image from a webcam then process it further using numpy and Scipy. Can somebody please help me out with the code.Here is the code there is a pre…

Loading Magnet LINK using Rasterbar libtorrent in Python

How would one load a Magnet link via rasterbar libtorrent python binding?

Python currying with any number of variables

I am trying to use currying to make a simple functional add in Python. I found this curry decorator here.def curry(func): def curried(*args, **kwargs):if len(args) + len(kwargs) >= func.__code__…

Python - Display rows with repeated values in csv files

I have a .csv file with several columns, one of them filled with random numbers and I want to find duplicated values there. In case there are - strange case, but its what I want to check after all -, I…