Problems with a binary one-hot (one-of-K) coding in python

2024/9/30 9:29:04

Binary one-hot (also known as one-of-K) coding lies in making one binary column for each distinct value for a categorical variable. For example, if one has a color column (categorical variable) that takes the values 'red', 'blue', 'yellow', and 'unknown' then a binary one-hot coding replaces the color column with binaries columns 'color=red', 'color=blue', and 'color=yellow'. I begin with data in a pandas data-frame and I want to use this data to train a model with scikit-learn. I know two ways to do the binary one-hot coding, none of them satisfactory to me.

  1. Pandas and get_dummies in the categorical columns of the data-frame. This method seems excellent as far as the original data-frame contains all data available. That is, you do the one-hot coding before splitting your data in training, validation, and test sets. However, if the data is already split in different sets, this method doesn't work very well. Why? Because one of the data sets (say, the test set) can contain fewer values for a given variable. For example, it can happen that whereas the training set contain the values red, blue, yellow, and unknown for the variable color, the test set only contains red and blue. So the test set would end up having fewer columns than the training set. (I don't know either how the new columns are sorted, and if even having the same columns, this could be in a different order in each set).

  2. Sklearn and DictVectorizer This solves the previous issue, as we can make sure that we are applying the very same transformation to the test set. However, the outcome of the transformation is a numpy array instead of a pandas data-frame. If we want to recover the output as a pandas data-frame, we need to (or at least this is the way I do it): 1) pandas.DataFrame(data=outcome of DictVectorizer transformation, index=index of original pandas data frame, columns= DictVectorizer().get_features_names) and 2) join along the index the resulting data-frame with the original one containing the numerical columns. This works, but it is somewhat cumbersome.

Is there a better way to do a binary one-hot encoding within a pandas data-frame if we have our data split in training and test set?

Answer

If your columns are in the same order, you can concatenate the dfs, use get_dummies, and then split them back again, e.g.,

encoded = pd.get_dummies(pd.concat([train,test], axis=0))
train_rows = train.shape[0]
train_encoded = encoded.iloc[:train_rows, :]
test_encoded = encoded.iloc[train_rows:, :] 

If your columns are not in the same order, then you'll have challenges regardless of what method you try.

https://en.xdnf.cn/q/71097.html

Related Q&A

How to hide the title bar in pygame?

I was wondering does anyone know how to hide the pygame task bar?I really need this for my pygame program!Thanks!

Deleting existing class variable yield AttributeError

I am manipulating the creation of classes via Pythons metaclasses. However, although a class has a attribute thanks to its parent, I can not delete it.class Meta(type):def __init__(cls, name, bases, dc…

Setting global font size in kivy

What is the preferred way, whether through python or the kivy language, to set the global font size (i.e. for Buttons and Labels) in kivy? What is a good way to dynamically change the global font size…

What is the difference between load name and load global in python bytecode?

load name takes its argument and pushes onto the stack the value of the name stored by store name at the position indicated by the argument . load global does something similar, but there appears to …

porting Python 2 program to Python 3, random line generator

I have a random line generator program written in Python2, but I need to port it to Python3. You give the program the option -n [number] and a file argument to tell it to randomly output [number] numbe…

Symbol not found, Expected in: flat namespace

I have a huge gl.pxd file with all the definitions of gl.h, glu.h and glut.h. For example it has these lines:cdef extern from <OpenGL/gl.h>:ctypedef unsigned int GLenumcdef void glBegin( GLenum m…

Why does Django not generate CSRF or Session Cookies behind a Varnish Proxy?

Running Django 1.2.5 on a Linux server with Apache2 and for some reason Django seems like it cannot store CSRF or Session cookies. Therefore when I try to login to the Django admin it gives me a CSRF v…

Shared state with aiohttp web server

My aiohttp webserver uses a global variable that changes over time:from aiohttp import web shared_item = blaasync def handle(request):if items[test] == val:shared_item = doedaprint(shared_item)app =…

ModuleNotFoundError: No module named matplotlib.pyplot

When making a plot, I used both Jupyter Notebook and Pycharm with the same set of code and packages. The code is: import pandas as pd import numpy as np import matplotlib.pyplot as plt # as in Pycha…

python linux - display image with filename as viewer window title

when I display an image with PIL Image it opens an imagemagick window but the title is some gibberish name like tmpWbfj48Bfjf. How do I make the image filename to be the title of the viewer window?