Using multiple custom classes with Pipeline sklearn (Python)

2024/9/8 10:33:50

I try to do a tutorial on Pipeline for students but I block. I'm not an expert but I'm trying to improve. So thank you for your indulgence. In fact, I try in a pipeline to execute several steps in preparing a dataframe for a classifier:

  • Step 1: Description of the dataframe
  • Step 2: Fill NaN Values
  • Step 3: Transforming Categorical Values into Numbers

Here is my code:

class Descr_df(object):def transform (self, X):print ("Structure of the data: \n {}".format(X.head(5)))print ("Features names: \n {}".format(X.columns))print ("Target: \n {}".format(X.columns[0]))print ("Shape of the data: \n {}".format(X.shape))def fit(self, X, y=None):return selfclass Fillna(object):def transform(self, X):non_numerics_columns = X.columns.difference(X._get_numeric_data().columns)for column in X.columns:if column in non_numerics_columns:X[column] = X[column].fillna(df[column].value_counts().idxmax())else:X[column] = X[column].fillna(X[column].mean())            return Xdef fit(self, X,y=None):return selfclass Categorical_to_numerical(object):def transform(self, X):non_numerics_columns = X.columns.difference(X._get_numeric_data().columns)le = LabelEncoder()for column in non_numerics_columns:X[column] = X[column].fillna(X[column].value_counts().idxmax())le.fit(X[column])X[column] = le.transform(X[column]).astype(int)return Xdef fit(self, X, y=None):return self

If I execute step 1 and 2 or step 1 and 3 it works but if I execute step 1, 2 and 3 at the same time. I have this error:

pipeline = Pipeline([('df_intropesction', Descr_df()), ('fillna',Fillna()), ('Categorical_to_numerical', Categorical_to_numerical())])
pipeline.fit(X, y)
AttributeError: 'NoneType' object has no attribute 'columns'
Answer

This error arises because in the Pipeline the output of first estimator goes to the second, then the output of second estimator goes to third and so on...

From the documentation of Pipeline:

Fit all the transforms one after the other and transform the data,then fit the transformed data using the final estimator.

So for your pipeline, the steps of execution are following:

  1. Descr_df.fit(X) -> doesn't do anything and returns self
  2. newX = Descr_df.transform(X) -> should return some value to assign to newX that should be passed on to next estimator, but your definition does not return anything (only prints). So None is returned implicitly
  3. Fillna.fit(newX) -> doesn't do anything and returns self
  4. Fillna.transform(newX) -> Calls newX.columns. But newX=None from step2. Hence the error.

Solution: Change the transform method of Descr_df to return the dataframe as it is:

def transform (self, X):print ("Structure of the data: \n {}".format(X.head(5)))print ("Features names: \n {}".format(X.columns))print ("Target: \n {}".format(X.columns[0]))print ("Shape of the data: \n {}".format(X.shape))return X

Suggestion : Make your classes inherit from Base Estimator and Transformer classes in scikit to confirm to the good practice.

i.e change the class Descr_df(object) to class Descr_df(BaseEstimator, TransformerMixin), Fillna(object) to Fillna(BaseEstimator, TransformerMixin) and so on.

See this example for more details on custom classes in Pipeline:

  • http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html#sphx-glr-auto-examples-hetero-feature-union-py
https://en.xdnf.cn/q/73172.html

Related Q&A

Python equivalent of pointers

In python everything works by reference:>>> a = 1 >>> d = {a:a} >>> d[a] 1 >>> a = 2 >>> d[a] 1I want something like this>>> a = 1 >>> d =…

Pip is broken, gives PermissionError: [WinError 32]

I installed the python-certifi-win32 module (Im so busy trying to fix this problem that I dont even remember why I originally installed it). Right after I installed it, though, I started getting this e…

Pandas highlight rows based on index name

I have been struggling with how to style highlight pandas rows based on index names. I know how to highlight selected rows but when I have to highlight based on the index, the code is not working.Setup…

Histogram of sum instead of count using numpy and matplotlib

I have some data with two columns per row. In my case job submission time and area.I have used matplotlibs hist function to produce a graph with time binned by day on the x axis, and count per day on t…

Find subsequences of strings within strings

I want to make a function which checks a string for occurrences of other strings within them. However, the sub-strings which are being checked may be interrupted within the main string by other letters…

How to bestow string-ness on my class?

I want a string with one additional attribute, lets say whether to print it in red or green.Subclassing(str) does not work, as it is immutable. I see the value, but it can be annoying.Can multiple inhe…

How to pass Python instance to C++ via Python/C API

Im extending my library with Python (2.7) by wrapping interfaces with SWIG 2.0, and have a graph object in which I want to create a visitor. In C++, the interface looks like this:struct Visitor{virtua…

REST API in Python with FastAPI and pydantic: read-only property in model

Assume a REST API which defines a POST method on a resource /foos to create a new Foo. When creating a Foo the name of the Foo is an input parameter (present in the request body). When the server creat…

a class with all static methods [closed]

Closed. This question is opinion-based. It is not currently accepting answers.Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.Clo…

How can I find null values with SELECT query in psycopg?

I am using psycopg2 library in python and the INSERT query works good when I insert null Value with None, but when I want to do SELECT null values, with None doesnt return any.cur.execute("SELECT …