sklearn pipeline transform ValueError that Expected Value is not equal to Trained Value

2024/10/8 6:19:20

Can you please help me to with the following function where I got the error of ValueError: Column ordering must be equal for fit and for transform when using the remainder keyword

(The function is called on a pickled sklearn pipeline that I had saved in GCP Storage.)


ValueError                                Traceback (most recent call last)
<ipython-input-192-c6a8bc0ab221> in <module>
----> 1 safety_project_lite(request)<ipython-input-190-24c565131f14> in safety_project_lite(request)31 32     df_resp = pd.DataFrame(data=request_data)
---> 33     response = loaded_model.predict(df_resp)34 35     output = {"Safety Rating": response[0]}~/.local/lib/python3.5/site-packages/sklearn/utils/ in <lambda>(*args, **kwargs)114 115         # lambda, but not partial, allows help() to work with update_wrapper
--> 116         out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)117         # update the docstring of the returned function118         update_wrapper(out, self.fn)~/.local/lib/python3.5/site-packages/sklearn/ in predict(self, X, **predict_params)417         Xt = X418         for _, name, transform in self._iter(with_final=False):
--> 419             Xt = transform.transform(Xt)420         return self.steps[-1][-1].predict(Xt, **predict_params)421 ~/.local/lib/python3.5/site-packages/sklearn/compose/ in transform(self, X)581             if (n_cols_transform >= n_cols_fit and582                     any(X.columns[:n_cols_fit] != self._df_columns)):
--> 583                 raise ValueError('Column ordering must be equal for fit '584                                  'and for transform when using the '585                                  'remainder keyword')ValueError: Column ordering must be equal for fit and for transform when using the remainder keyword


def safety_project_lite_beta(request):client = storage.Client(request.GCP_Project)bucket = client.get_bucket(request.GCP_Bucket)blob = bucket.blob(request.GCP_Path)model_file = BytesIO()blob.download_to_file(model_file)loaded_model = pickle.loads(model_file.getvalue())request_data = {'A': [request.A],'B': [request.B],'C': [request.C],'D': [request.D],'E': [request.E],'F': [request.F]}df_resp = pd.DataFrame(data=request_data)response = loaded_model.predict(df_resp)output = {"Rating": response[0]}return output

The model can only predict if the data you feed it is of the same structure as it has been trained on.

To force the fact that df_resp has the same columns as X_train, pass a list of its columns along when building the dataframe:

df_resp = pd.DataFrame(request_data, columns=X_train.columns)

If that variable is for some reason not available, you could pickle its column list (X_train.columns) and use it later:

loaded_cols = pickle.loads([...])
df_resp = pd.DataFrame(data=request_data, columns=loaded_cols)

This ensures a more dynamic workflow where you could add columns more easily for example.

