SKlearn prediction on test dataset with different shape from training dataset shape

2024/10/12 22:30:59

I'm new to ML and would be grateful for any assistance provided. I've run a linear regression prediction using test set A and training set A. I saved the linear regression model and would now like to use the same model to predict a test set A target using features from test set B. Each time I run the model it throws up the error below

How can I successfully predict a test data set from features and a target with different shapes?

Input
print(testB.shape)
print(testA.shape)Output
(2480, 5)
(1315, 6)Input
saved_model = joblib.load(filename)
testB_result = saved_model.score(testB_features, testA_target)
print(testB_result)Output
ValueError: Found input variables with inconsistent numbers of samples: [1315, 2480]

Thanks again

Answer

They are inconsistent shapes which is why the error is being thrown. Have you tried to reshape the data so one of them are same shape? From a quick look, it seems that you have more samples and one less feature in testA.

Think about it, if you have trained your model with 5 features you cannot then ask the same model to make a prediction given 6 features. You speak of using a Linear Regressor, the equation is roughly:

y  = b + w0*x0 + w1*x1 + w2*x2 + .. + wN-1*xN-1 Where { y is your output/labelN is the number of featuresb is the bias termw(i) is the ith weightx(i) is the ith feature value}

You have trained a linear regressor with 5 features, effectively producing the following

y (your output/label) = b + w0*x0 + w1*x1 + w2*x2 + w3*x3 + w4*x4

You then ask it to make a prediction given 6 features but it only knows how to deal with 5.

Aside from that issue, you also have too many samples, testB has 2480 and testA has 1315. These need to match, as the model wants to make 2480 predictions, but you only give it 1315 outputs to compare it to. How can you get a score for 1165 missing samples? Do you now see why the data has to be reshaped?

EDIT

Assuming you have datasets with an equal amount of features as discussed above, you may now look at reshaping (removing data) testB like so:

testB = testB[0:1314, :]
testB.shape
(1315, 5)  

Or, if you would prefer a solution using the numpy API:

testB = np.delete(testB, np.s_[0:(len(testB)-len(testA))], axis=0)
testB.shape
(1315, 5)

Keep in mind, when doing this you slice out a number of samples. If this is important to you (which it can be) then it may be better to introduce a pre-processing step to help out with the missing values, namely imputing them like this. It is worth noting that the data you are reshaping should be shuffled (unless it is already), as you may be removing parts of the data the model should be learning about. Neglecting to do this could result in a model that may not generalise as well as you hoped.

https://en.xdnf.cn/q/118154.html

Related Q&A

How to eliminate suspicious barcode (like 123456) data [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.Want to improve this question? Update the question so it focuses on one problem only by editing this post.Closed 6…

how to get href link from onclick function in python

I want to get href link of website form onclick function Here is html code in which onclick function call a website <div class="fl"><span class="taLnk" onclick="ta.tr…

Python tkinters entry.get() does not work, how can I fix it? [duplicate]

This question already has answers here:Why is Tkinter Entrys get function returning nothing?(6 answers)Closed 7 years ago.I am building a simple program for university. We have to convert our code to …

Pandas secondary y axis for boxplots

Id like to use a secondary y-axis for some boxplots in pandas, but it doesnt seem available. import numpy as np import pandas as pddata = np.random.random((10, 5)) data[:,-1] += 10 # offset one column…

Fixing Negative Assertion for end of string

I am trying to accept a capture group only if the pattern matches and there is not a specific word before the end of the group. Ive tried a # of approaches and none seem to work, clearly Im not getting…

Two Sorted Arrays, sum of 2 elements equal a certain number

I was wondering if I could get some help. I want to find an algorithm that is THETA(n) or linear time for determining whether 2 numbers in a 2 sorted arrays add up to a certain number.For instance, let…

I cant seem to install numpy

I tried to install numpy, but whenever I start my program, I get these messages.Error importing numpy: you should not try to import numpy fromits source directory; please exit the numpy source tree, an…

Using slices in Python

I use the dataset from UCI repo: http://archive.ics.uci.edu/ml/datasets/Energy+efficiency Then doing next:from pandas import * from sklearn.neighbors import KNeighborsRegressor from sklearn.linear_mode…

Elasticsearch delete_by_query wrong usage

I am using 2 similar ES methods to load and delete documents:result = es.search(index=users_favourite_documents,doc_type=favourite_document,body={"query": {"match": {user: user}}})A…

SQLAlchemy: Lost connection to MySQL server during query

There are a couple of related questions regarding this, but in my case, all those solutions is not working out. Thats why I thought of asking again. I am getting this error while I am firing below quer…