I am using sklearn for text classification, all my features are numerical but my target variable labels are in text. I can understand the rationale behind encoding features to numerics but don't think this applies for the target variable?
I am using sklearn for text classification, all my features are numerical but my target variable labels are in text. I can understand the rationale behind encoding features to numerics but don't think this applies for the target variable?
If your target variable is in textual form, you can transform it into numeric form (or you can leave it alone, please see my note below) in order for any Scikit-learn algorithm to pick it in an OVA (One Versus All) scheme: your learning algorithm will try to guess each class as compared against the residual ones only when they will be transformed into numeric codes starting from 0 to (number of classes - 1).
For instance, in this example from the Scikit-Learn documentation, you can figure out the class of your iris because there are three models that evaluate each possible class:
Naturally, classes 0, 1 and 2 are Setosa, Versicolor, and Virginica, but the algorithm needs them expressed as numeric codes, as you can verify by exploring the results of the example code:
list(iris.target_names)
['setosa', 'versicolor', 'virginica']np.unique(Y)
array([0, 1, 2])
NOTE: it is true that Scikit-learn encodes by itself the target labelsif they are strings. On Scikit-learn's Github page for logisticregression(https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/logistic.py)you can see at rows 1623 and 1624 where the code calls the label encoderand it encodes labels automatically:
# Encode for string labels label_encoder = LabelEncoder().fit(y) y = label_encoder.transform(y)