I am trying to fit a multivariable linear regression on a dataset to find out how well the model explains the data. My predictors have 120 dimensions and I have 177 samples:
X.shape=(177,120), y.shape=(177,)
Using statsmodels, I get a very good R-squared of 0.76 with a Prob(F-statistic) of 0.06 which trends towards significance and indicates a good model for the data.
When I use scikit-learn's linear regression and try to compute 5-fold cross validation r2 score, I get an average r2 score of -5.06 which shows very poor generalization performance.
The two models should be exactly the same as their train r2 score is. So why the performance evaluations from these libraries are too different? Which one should I use? Greatly appreciate your comments on this.
Here is my code for your reference:
# using statsmodel:import statsmodels.api as smX = sm.add_constant(X)est = sm.OLS(y, X)est2 = est.fit()print(est2.summary())# using scikitlearn:from sklearn.linear_model import LinearRegressionlin_reg = LinearRegression()lin_reg.fit(X, y)print 'train r2 score:',lin_reg.score(X, y)cv_results = cross_val_score(lin_reg, X, y, cv = 5, scoring = 'r2')msg = "%s: %f (%f)" % ('r2 score', cv_results.mean(),cv_results.std())print(msg)