Question 1

I am applying OneHotEncoder on numpy array.

Here's the code

print X.shape, test_data.shape #gives 4100, 15) (410, 15)
onehotencoder_1 = OneHotEncoder(categorical_features = [0, 3, 4, 5, 6, 8, 9, 11, 12])
X = onehotencoder_1.fit_transform(X).toarray()
onehotencoder_2 = OneHotEncoder(categorical_features = [0, 3, 4, 5, 6, 8, 9, 11, 12])
test_data = onehotencoder_2.fit_transform(test_data).toarray()print X.shape, test_data.shape #gives (4100, 46) (410, 43)

where both X and test_data are <type 'numpy.ndarray'>

X is my train set while test_data my test set.

How come the no. of columns different for X & test_data. they should be 46 or either 43 for both after applying onehotencoder.

I am applying OnehotEncoder on specific attributes as they are categorical in nature in both X and test_data

Can someone point out what is wrong here?

Question 2

Don't use a new OneHotEncoder on test_data, use the first one, and only use transform() on it. Do this:

test_data = onehotencoder_1.transform(test_data).toarray()

Never use fit() (or fit_transform()) on testing data.

The different number of columns are entirely possible because it may happen that test data dont contain some categories which are present in train data. So when you use a new OneHotEncoder and call fit() (or fit_transform()) on it, it will only learn about categories present in test_data. So there will be difference between the columns.

applying onehotencoder on numpy array

Related Q&A

How to delete temp folder data using python script [closed]

Save a list of objects on exit of pygame game [closed]

Trying to make loop for a function that stops after the result is lower than a certain value

python url extract from html

Regex match each character at least once [closed]

How to cluster with K-means, when number of clusters and their sizes are known [closed]

Converting German characters (like , etc) from Mac Roman to UTF (or similar)?

Caesar cipher without knowing the Key

how to convert u\uf04a to unicode in python [duplicate]

How can I display a nxn matrix depending on users input?