I am trying to do multiple variables linear regression. But I find that the sklearn.linear_model working very weird. Here's my code:
import numpy as np
from sklearn import linear_modelb = np.array([3,5,7]).transpose() ## the right answer I am expecting
x = np.array([[1,6,9], ## 1*3 + 6*5 + 7*9 = 96[2,7,7], ## 2*3 + 7*5 + 7*7 = 90[3,4,5]]) ## 3*3 + 4*5 + 5*7 = 64
y = np.array([96,90,64]).transpose()clf = linear_model.LinearRegression()
clf.fit([[1,6,9],[2,7,7],[3,4,5]], [96,90,64])
print clf.coef_ ## <== it gives me [-2.2 5 4.4] NOT [3, 5, 7]
print np.dot(x, clf.coef_) ## <== it gives me [ 67.4 61.4 35.4]
In order to find your initial coefficients back you need to use the keyword fit_intercept=False
when construction the linear regression.
import numpy as np
from sklearn import linear_modelb = np.array([3,5,7])
x = np.array([[1,6,9], [2,7,7], [3,4,5]])
y = np.array([96,90,64])clf = linear_model.LinearRegression(fit_intercept=False)
clf.fit(x, y)
print clf.coef_
print np.dot(x, clf.coef_)
Using fit_intercept=False
prevents the LinearRegression
object from working with x - x.mean(axis=0)
, which it would otherwise do (and capture the mean using a constant offset y = xb + c
) - or equivalently by adding a column of 1
to x
.
As a side remark, calling transpose
on a 1D array doesn't have any effect (it reverses the order of your axes, and you only have one).