Sklearn Pipeline all the input array dimensions for the concatenation axis must match exactly

2024/9/20 21:27:09
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.preprocessing import MinMaxScaler
from sklearn.compose import ColumnTransformerdata = [[1, 3, 4, 'text', 'pos'], [9, 3, 6, 'text more', 'neg']]
data = pd.DataFrame(data, columns=['Num1', 'Num2', 'Num3', 'Text field', 'Class'])tweet_text_transformer = Pipeline(steps=[('count_vectoriser', CountVectorizer()),('tfidf', TfidfTransformer())
])numeric_transformer = Pipeline(steps=[('scaler', MinMaxScaler())
])preprocessor = ColumnTransformer(transformers=[# (name, transformer, column(s))('tweet', tweet_text_transformer, ['Text field']),('numeric', numeric_transformer, ['Num1', 'Num2', 'Num3'])
])pipeline = Pipeline(steps=[('preprocessor', preprocessor),('classifier', LinearSVC())
])X_train = data.loc[:, 'Num1':'Text field']
y_train = data['Class']
pipeline.fit(X_train, y_train)

I don't understand where this error is coming from:

ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 0, the array at index 0 has size 1 and the array at index 1 has size 2

Answer

Reason

The issue is in the preprocessor pipeline, The way this pipeline works is the output of tweet_text_transformer and the output of numeric_transformer are stacked horizontally, For this to successfully happen both the outputs(tweet_text_transformer and numeric_transformer) must have the same number of rows(ie: number of elements in axis 0 or dimension-0)

But when the above pipeline is executed the tweet_text_processor, though we expect it to give 2 * 2 matrix with 4 elements in reality since CountVectorizer stores the output as sparse matrix it removes any zeroes in the matrix(to save memory) this reduces the array to 2*2 matrix but with only 3 elements in it and when this to be stacked with the output of numeric_transformer it does not satisfy the above mentioned condition(since numeric transformer would have two elements in axis 0 and the twwet_text_processor would not)

Output of the explination

Solution

  • Create a custom transformer which converts this sparse matrix to numpy array
  • Also since there is only one column so squeeze the Pandas dataframe to convert it into Panadas Series
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.preprocessing import MinMaxScaler
from sklearn.compose import ColumnTransformerdata = [[1, 3, 4, 'text', 'pos'], [9, 3, 6, 'text more', 'neg']]
data = pd.DataFrame(data, columns=['Num1', 'Num2', 'Num3', 'Text field', 'Class'])class TweetTextProcessor(BaseEstimator, TransformerMixin):def __init__(self):self.tweet_text_transformer = Pipeline(steps=[('count_vectoriser', CountVectorizer()),('tfidf', TfidfTransformer())    ])def fit(self, X, y=None):return selfdef transform(self, X, y=None):return  self.tweet_text_transformer.fit_transform(X.squeeze()).toarray()numeric_transformer = Pipeline(steps=[('scaler', MinMaxScaler())
])preprocessor = ColumnTransformer(transformers=[('tweet', TweetTextProcessor(), ['Text field']),('numeric', numeric_transformer, ['Num1', 'Num2', 'Num3'])
])pipeline = Pipeline(steps=[('preprocessor', preprocessor),('classifier', LinearSVC())
])X_train = data.loc[:, 'Num1':'Text field']
y_train = data['Class']
pipeline.fit(X_train, y_train)

The above code should work, Let me know otherwise or if the explanation was not clear(hopefully it is)

https://en.xdnf.cn/q/72455.html

Related Q&A

Python Pandas read_excel returns empty Dataframe

Reading a simple xls returning empty dataframe, cant figure it out for the life of me:path = (c:/Users/Desktop/Stuff/Ready) files = os.listdir(path) print(files)files_xlsx = [f for f in files if f[-3:]…

Fill matrix with transposed version

I have a pairwise matrix:>>> ma b c d a 1.0 NaN NaN NaN b 0.5 1.0 NaN NaN c 0.6 0.0 1.0 NaN d 0.5 0.4 0.3 1.0I want to replace the NaN in the the top right with the same va…

Subclass of numpy ndarray doesnt work as expected

`Hello, everyone.I found there is a strange behavior when subclassing a ndarray.import numpy as npclass fooarray(np.ndarray):def __new__(cls, input_array, *args, **kwargs):obj = np.asarray(input_array)…

How to correctly load images asynchronously in PyQt5?

Im trying to figure out how to accomplish an async image load correctly, in PyQt Qlistview.My main widget consists of a Qlistview and a QLineEdit textbox. I have a database of actors which I query usin…

How to print results of Python ThreadPoolExecutor.map immediately?

I am running a function for several sets of iterables, returning a list of all results as soon as all processes are finished.def fct(variable1, variable2):# do an operation that does not necessarily ta…

Python dir equivalent in perl?

The dir command in Python 2.7.x lists all accessible symbols from a module. Is there an equivalent in Perl 5.x to list all accessible symbols from a package?

Entire JSON into One SQLite Field with Python

I have what is likely an easy question. Im trying to pull a JSON from an online source, and store it in a SQLite table. In addition to storing the data in a rich table, corresponding to the many fiel…

Python scipy module import error due to missing ._ufuncs dll

I have some troubles with sub-module integrate from scipy in python. I have a 64 bits architecture, and it seems, according to the first lines of the python interpreter (see below) that I am also using…

How can I call python program from VBA?

Just as the title goes.I have a python program which processes some data file I downloaded from email.I am writing a vba script which can download the email attachments and execute the python program t…

Embedding CPython: how do you constuct Python callables to wrap C callback pointers?

Suppose I am embedding the CPython interpreter into a larger program, written in C. The C component of the program occasionally needs to call functions written in Python, supplying callback functions …