Pandas One hot encoding: Bundling together less frequent categories

2024/10/18 15:42:41

I'm doing one hot encoding over a categorical column which has some 18 different kind of values. I want to create new columns for only those values, which appear more than some threshold (let's say 1%), and create another column named other values which has 1 if value is other than those frequent values.

I'm using Pandas with Sci-kit learn. I've explored pandas get_dummies and sci-kit learn's one hot encoder, but can't figure out how to bundle together less frequent values into one column.

Answer

plan

  • pd.get_dummies to one hot encode as normal
  • sum() < threshold to identify columns that get aggregated
    • I use pd.value_counts with the parameter normalize=True to get percentage of occurance.
  • join

def hot_mess2(s, thresh):d = pd.get_dummies(s)f = pd.value_counts(s, sort=False, normalize=True) < threshif f.sum() == 0:return delse:return d.loc[:, ~f].join(d.loc[:, f].sum(1).rename('other'))

Consider the pd.Series s

s = pd.Series(np.repeat(list('abcdef'), range(1, 7)))s0     a
1     b
2     b
3     c
4     c
5     c
6     d
7     d
8     d
9     d
10    e
11    e
12    e
13    e
14    e
15    f
16    f
17    f
18    f
19    f
20    f
dtype: object

hot_mess(s, 0)

    a  b  c  d  e  f
0   1  0  0  0  0  0
1   0  1  0  0  0  0
2   0  1  0  0  0  0
3   0  0  1  0  0  0
4   0  0  1  0  0  0
5   0  0  1  0  0  0
6   0  0  0  1  0  0
7   0  0  0  1  0  0
8   0  0  0  1  0  0
9   0  0  0  1  0  0
10  0  0  0  0  1  0
11  0  0  0  0  1  0
12  0  0  0  0  1  0
13  0  0  0  0  1  0
14  0  0  0  0  1  0
15  0  0  0  0  0  1
16  0  0  0  0  0  1
17  0  0  0  0  0  1
18  0  0  0  0  0  1
19  0  0  0  0  0  1
20  0  0  0  0  0  1

hot_mess(s, .1)

    c  d  e  f  other
0   0  0  0  0      1
1   0  0  0  0      1
2   0  0  0  0      1
3   1  0  0  0      0
4   1  0  0  0      0
5   1  0  0  0      0
6   0  1  0  0      0
7   0  1  0  0      0
8   0  1  0  0      0
9   0  1  0  0      0
10  0  0  1  0      0
11  0  0  1  0      0
12  0  0  1  0      0
13  0  0  1  0      0
14  0  0  1  0      0
15  0  0  0  1      0
16  0  0  0  1      0
17  0  0  0  1      0
18  0  0  0  1      0
19  0  0  0  1      0
20  0  0  0  1      0
https://en.xdnf.cn/q/72827.html

Related Q&A

How to pass classs self through a flask.Blueprint.route decorator?

I am writing my websites backend using Flask and Python 2.7, and have run into a bit of a problem. I like to use classes to enclose my functions, it makes things neat for me and helps me keep everythin…

why cannot I use sp.signal by import scipy as sp? [duplicate]

This question already has an answer here:scipy.special import issue(1 answer)Closed 8 years ago.I would like to use scipy.signal.lti and scipy.signal.impulse function to calculate the transfer function…

How to speed up nested cross validation in python?

From what Ive found there is 1 other question like this (Speed-up nested cross-validation) however installing MPI does not work for me after trying several fixes also suggested on this site and microso…

Streaming video from camera in FastAPI results in frozen image after first frame

I am trying to stream video from a camera using FastAPI, similar to an example I found for Flask. In Flask, the example works correctly, and the video is streamed without any issues. However, when I tr…

Fastest way to concatenate multiple files column wise - Python

What is the fastest method to concatenate multiple files column wise (within Python)?Assume that I have two files with 1,000,000,000 lines and ~200 UTF8 characters per line.Method 1: Cheating with pas…

Can autograd in pytorch handle a repeated use of a layer within the same module?

I have a layer layer in an nn.Module and use it two or more times during a single forward step. The output of this layer is later inputted to the same layer. Can pytorchs autograd compute the grad of t…

Altering numpy function output array in place

Im trying to write a function that performs a mathematical operation on an array and returns the result. A simplified example could be:def original_func(A):return A[1:] + A[:-1]For speed-up and to avoi…

Does the E-factory of lxml support dynamically generated data?

Is there a way of creating the tags dynamically with the E-factory of lxml? For instance I get a syntax error for the following code:E.BODY(E.TABLE(for row_num in range(len(ws.rows)):row = ws.rows[row…

Check if datetime object in pandas has a timezone?

Im importing data into pandas and want to remove any timezones – if theyre present in the data. If the data has a time zone, the following code works successfully: col = "my_date_column" df[…

Extract translator comments with xgettext from JavaScript (in Python mode)

I have a pretty well-working command that extracts strings from all my .js and .html files (which are just Underscore templates). However, it doesnt seem to work for Translator comments.For example, I …