LabelEncoder: How to keep a dictionary that shows original and converted variable

2024/10/13 5:21:16

When using LabelEncoder to encode categorical variables into numerics,

how does one keep a dictionary in which the transformation is tracked?

i.e. a dictionary in which I can see which values became what:

{'A':1,'B':2,'C':3}
Answer

I created a dictionary from classes_

le = preprocessing.LabelEncoder()
ids = le.fit_transform(labels)
mapping = dict(zip(le.classes_, range(len(le.classes_))))

to test:

all([mapping[x] for x in le.inverse_transform(ids)] == ids)

should return True.

This works because fit_transform uses numpy.unique to simultaneously calculate the label encoding and the classes_ attribute:

def fit_transform(self, y):self.classes_, y = np.unique(y, return_inverse=True)return y
https://en.xdnf.cn/q/69570.html

Related Q&A

How to find hidden files inside image files (Jpg/Gif/Png) [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, argum…

How to open a simple image using streams in Pillow-Python

from PIL import Imageimage = Image.open("image.jpg")file_path = io.BytesIO();image.save(file_path,JPEG);image2 = Image.open(file_path.getvalue());I get this error TypeError: embedded NUL char…

SyntaxError: Non-UTF-8 code starting with \x82 [duplicate]

This question already has answers here:"SyntaxError: Non-ASCII character ..." or "SyntaxError: Non-UTF-8 code starting with ..." trying to use non-ASCII text in a Python script(7 an…

How to identify the CPU core ID of a process on Python multiprocessing?

I am testing Pythons multiprocessing module on a cluster with SLURM. I want to make absolutely sure that each of my tasks are actually running on separate cpu cores as I intend. Due to the many possibi…

Finding highest values in each row in a data frame for python

Id like to find the highest values in each row and return the column header for the value in python. For example, Id like to find the top two in each row:df = A B C D 5 9 8 2 4 …

Using pytest_addoptions in a non-root conftest.py

I have a project that has the following structure: Project/ | +-- src/ | | | +-- proj/ | | | +-- __init__.py | +-- code.py | +-- tests/ | | | +-- __init_…

How to count distinct values in a combination of columns while grouping by in pandas?

I have a pandas data frame. I want to group it by using one combination of columns and count distinct values of another combination of columns.For example I have the following data frame:a b c …

Set Environmental Variables in Python with Popen

I want to set an environmental variable in linux terminal through a python script. I seem to be able to set environmental variables when using os.environ[BLASTDB] = /path/to/directory .However I was in…

Python - pandas - Append Series into Blank DataFrame

Say I have two pandas Series in python:import pandas as pd h = pd.Series([g,4,2,1,1]) g = pd.Series([1,6,5,4,"abc"])I can create a DataFrame with just h and then append g to it:df = pd.DataFr…

How to retrieve values from a function run in parallel processes?

The Multiprocessing module is quite confusing for python beginners specially for those who have just migrated from MATLAB and are made lazy with its parallel computing toolbox. I have the following fun…