applying onehotencoder on numpy array

2024/11/19 22:54:24

I am applying OneHotEncoder on numpy array.

Here's the code

print X.shape, test_data.shape #gives 4100, 15) (410, 15)
onehotencoder_1 = OneHotEncoder(categorical_features = [0, 3, 4, 5, 6, 8, 9, 11, 12])
X = onehotencoder_1.fit_transform(X).toarray()
onehotencoder_2 = OneHotEncoder(categorical_features = [0, 3, 4, 5, 6, 8, 9, 11, 12])
test_data = onehotencoder_2.fit_transform(test_data).toarray()print X.shape, test_data.shape #gives (4100, 46) (410, 43)

where both X and test_data are <type 'numpy.ndarray'>

X is my train set while test_data my test set.

How come the no. of columns different for X & test_data. they should be 46 or either 43 for both after applying onehotencoder.

I am applying OnehotEncoder on specific attributes as they are categorical in nature in both X and test_data

Can someone point out what is wrong here?

Answer

Don't use a new OneHotEncoder on test_data, use the first one, and only use transform() on it. Do this:

test_data = onehotencoder_1.transform(test_data).toarray()

Never use fit() (or fit_transform()) on testing data.

The different number of columns are entirely possible because it may happen that test data dont contain some categories which are present in train data. So when you use a new OneHotEncoder and call fit() (or fit_transform()) on it, it will only learn about categories present in test_data. So there will be difference between the columns.

https://en.xdnf.cn/q/119904.html

Related Q&A

How to delete temp folder data using python script [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.Want to improve this question? Update the question so it focuses on one problem only by editing this post.Closed 6…

Save a list of objects on exit of pygame game [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.Want to improve this question? Add details and clarify the problem by editing this post.Closed 10 years ago.Improv…

Trying to make loop for a function that stops after the result is lower than a certain value

Im taking a beginner python class and part of an exercise we were given was this:The point x with the property x= sin(x)−ax+ 30 is called a fixed point of the function f(x) = sin(x)−ax+ 30. It can b…

python url extract from html

I need python regex to extract urls from html, example html code :<a href=""http://a0c5e.site.it/r"" target=_blank><font color=#808080>MailUp</font></a> <…

Regex match each character at least once [closed]

Its difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying thi…

How to cluster with K-means, when number of clusters and their sizes are known [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.Want to improve this question? Update the question so it focuses on one problem only by editing this post.Closed 1…

Converting German characters (like , etc) from Mac Roman to UTF (or similar)?

I have a CSV file which I can read in and at all works fine except for the specific German (and possibly other) characters. Ive used chardet to determine that the encoding is Mac Roman import chardetde…

Caesar cipher without knowing the Key

Hey guys if you look at my code below you will be able to see that i was able to create a program that can open a file decode the content of the file and save it into another file but i need to input t…

how to convert u\uf04a to unicode in python [duplicate]

This question already has answers here:Python unicode codepoint to unicode character(4 answers)Closed 2 years ago.I am trying to decode u\uf04a in python thus I can print it without error warnings. In …

How can I display a nxn matrix depending on users input?

For a school task I need to display a nxn matrix depending on users input: heres an example: 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0(users input: 5) And here is my code until now:n = int(inpu…