Question 1

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.preprocessing import MinMaxScaler
from sklearn.compose import ColumnTransformerdata = [[1, 3, 4, 'text', 'pos'], [9, 3, 6, 'text more', 'neg']]
data = pd.DataFrame(data, columns=['Num1', 'Num2', 'Num3', 'Text field', 'Class'])tweet_text_transformer = Pipeline(steps=[('count_vectoriser', CountVectorizer()),('tfidf', TfidfTransformer())
])numeric_transformer = Pipeline(steps=[('scaler', MinMaxScaler())
])preprocessor = ColumnTransformer(transformers=[# (name, transformer, column(s))('tweet', tweet_text_transformer, ['Text field']),('numeric', numeric_transformer, ['Num1', 'Num2', 'Num3'])
])pipeline = Pipeline(steps=[('preprocessor', preprocessor),('classifier', LinearSVC())
])X_train = data.loc[:, 'Num1':'Text field']
y_train = data['Class']
pipeline.fit(X_train, y_train)

I don't understand where this error is coming from:

ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 0, the array at index 0 has size 1 and the array at index 1 has size 2

Question 2

Reason

The issue is in the preprocessor pipeline, The way this pipeline works is the output of tweet_text_transformer and the output of numeric_transformer are stacked horizontally, For this to successfully happen both the outputs(tweet_text_transformer and numeric_transformer) must have the same number of rows(ie: number of elements in axis 0 or dimension-0)

But when the above pipeline is executed the tweet_text_processor, though we expect it to give 2 * 2 matrix with 4 elements in reality since CountVectorizer stores the output as sparse matrix it removes any zeroes in the matrix(to save memory) this reduces the array to 2*2 matrix but with only 3 elements in it and when this to be stacked with the output of numeric_transformer it does not satisfy the above mentioned condition(since numeric transformer would have two elements in axis 0 and the twwet_text_processor would not)

Output of the explination

Solution

Create a custom transformer which converts this sparse matrix to numpy array
Also since there is only one column so squeeze the Pandas dataframe to convert it into Panadas Series

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.preprocessing import MinMaxScaler
from sklearn.compose import ColumnTransformerdata = [[1, 3, 4, 'text', 'pos'], [9, 3, 6, 'text more', 'neg']]
data = pd.DataFrame(data, columns=['Num1', 'Num2', 'Num3', 'Text field', 'Class'])class TweetTextProcessor(BaseEstimator, TransformerMixin):def __init__(self):self.tweet_text_transformer = Pipeline(steps=[('count_vectoriser', CountVectorizer()),('tfidf', TfidfTransformer())    ])def fit(self, X, y=None):return selfdef transform(self, X, y=None):return  self.tweet_text_transformer.fit_transform(X.squeeze()).toarray()numeric_transformer = Pipeline(steps=[('scaler', MinMaxScaler())
])preprocessor = ColumnTransformer(transformers=[('tweet', TweetTextProcessor(), ['Text field']),('numeric', numeric_transformer, ['Num1', 'Num2', 'Num3'])
])pipeline = Pipeline(steps=[('preprocessor', preprocessor),('classifier', LinearSVC())
])X_train = data.loc[:, 'Num1':'Text field']
y_train = data['Class']
pipeline.fit(X_train, y_train)

The above code should work, Let me know otherwise or if the explanation was not clear(hopefully it is)

Sklearn Pipeline all the input array dimensions for the concatenation axis must match exactly

Reason

Solution

Related Q&A

Python Pandas read_excel returns empty Dataframe

Fill matrix with transposed version

Subclass of numpy ndarray doesnt work as expected

How to correctly load images asynchronously in PyQt5?

How to print results of Python ThreadPoolExecutor.map immediately?

Python dir equivalent in perl?

Entire JSON into One SQLite Field with Python

Python scipy module import error due to missing ._ufuncs dll

How can I call python program from VBA?

Embedding CPython: how do you constuct Python callables to wrap C callback pointers?