I'm trying to use CountVectorizer()
with Pipeline
and ColumnTransformer
. Because CountVectorizer()
produces sparse matrix, I used FunctionTransformer
to ensure the ColumnTransformer
can hstack
correctly when putting together the resulting matrix.
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, FunctionTransformer
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from typing import Callable
# Dataset
df = pd.DataFrame([['a', 'Hi Tom', 'It is hot', 1],
['b', 'How you been Tom', 'hot coffee', 2],
['c', 'Hi you', 'I want some coffee', 3]],
columns=['col_for_ohe', 'col_for_countvectorizer_1', 'col_for_countvectorizer_2', 'num_col'])
# Use FunctionTransformer to ensure dense matrix
def tf_text(X, vectorizer_tf: Callable):
X_vect_ = vectorizer_tf.fit_transform(X)
return X_vect_.toarray()
tf_transformer = FunctionTransformer(tf_text, kw_args={'vectorizer_tf': CountVectorizer()})
# Transformation Pipelines
tf_transformer_pipe = Pipeline(
steps = [('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('tf', tf_transformer)])
ohe_transformer_pipe = Pipeline(
steps = [('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('ohe', OneHotEncoder(handle_unknown='ignore', sparse=False))])
transformer = ColumnTransformer(transformers=[
('cat_ohe', ohe_transformer_pipe, ['col_for_ohe']),
('cat_tf', tf_transformer_pipe, ['col_for_countvectorizer_1', 'col_for_countvectorizer_2'])
], remainder='passthrough')
transformed_df = transformer.fit_transform(df)
I get AttributeError: 'numpy.ndarray' object has no attribute 'lower.' I've seen this question and suspect CountVectorizer()
is the culprit but not sure how to solve it (previous question doesn't use ColumnTransformer
). I stumbled upon a DenseTransformer
that I wish I could use instead of FunctionTransformer
but unfortunately it is not supported in my company.
Imo, the first consideration to be done is that CountVectorizer()
requires 1D input; your example is not working because the imputation is returning a 2D numpy array which means that you'll need to add a customized treatment to make it work.
Then you should also consider that when using a CountVectorizer()
instance (which - again - requires 1D input) as transformer in a ColumnTransformer()
that's how you should pass transformers' columns:
columns: str, array-like of str, int, array-like of int, array-like of bool, slice or callable
Indexes the data on its second axis. Integers are interpreted as positional columns, while strings can reference DataFrame columns by name. A scalar string or int should be used where transformer expects X to be a 1d array-like (vector), otherwise a 2d array will be passed to the transformer. [...]
This would be useful in interpreting the snippet I'll post as a possible solution.
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, FunctionTransformer
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from typing import Callable
from sklearn.base import BaseEstimator, TransformerMixin
# Dataset
df = pd.DataFrame([['a', 'Hi Tom', 'It is hot', 1],
['b', 'How you been Tom', 'hot coffee', 2],
['c', 'Hi you', 'I want some coffee', 3]],
columns=['col_for_ohe', 'col_for_countvectorizer_1', 'col_for_countvectorizer_2', 'num_col'])
class DimTransformer(BaseEstimator, TransformerMixin):
def __init__(self):
pass
def fit(self, *_):
return self
def transform(self, X, *_):
return pd.DataFrame(X)
# Use FunctionTransformer to ensure dense matrix
def tf_text(X, vectorizer_tf: Callable):
X_vect_ = vectorizer_tf.fit_transform(X)
return X_vect_.toarray()
tf_transformer = FunctionTransformer(tf_text, kw_args={'vectorizer_tf': CountVectorizer()})
# Transformation Pipelines
tf_transformer_pipe = Pipeline(
steps = [('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('dt', DimTransformer()),
('ct', ColumnTransformer([
('tf1', tf_transformer, 0),
('tf2', tf_transformer, 1)
]))
])
ohe_transformer_pipe = Pipeline(
steps = [('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('ohe', OneHotEncoder(handle_unknown='ignore', sparse=False))])
transformer = ColumnTransformer(transformers=[
('cat_ohe', ohe_transformer_pipe, ['col_for_ohe']),
('cat_tf', tf_transformer_pipe, ['col_for_countvectorizer_1', 'col_for_countvectorizer_2'])
], remainder='passthrough')
transformed_df = transformer.fit_transform(df)
Namely, I'm adding a transformer that simply transforms the array returned by the SimpleImputer
instance in a DataFrame. Then - and most importantly - since it seems not possible to apply the vectorization on the 2D input that comes out of the previous two steps ('imputer'
and 'dt'
) I'm adding a further ColumnTransformer
which splits the vectorization in two parallel steps (a vectorization per column). Notice that at this point columns are referenced positionally as column names have possibly changed. Of course, that's a custom solution, but at least may provide some hints.
Given that you don't actually have missing values, you can see that it actually works by comparing it with the output from:
dt = DimTransformer().fit_transform(df)
ct = ColumnTransformer([
('tf1', tf_transformer, 1),
('tf2', tf_transformer, 2)
])
ct.fit_transform(dt)
print(ct.named_transformers_['tf1'].kw_args['vectorizer_tf'].vocabulary_) print(ct.named_transformers_['tf2'].kw_args['vectorizer_tf'].vocabulary_)
and noticing that columns from fourth to the last but one of the previous output (namely those affected by the application of 'cat_tf'
) do coincide with the ones just below.
Here are a couple of posts with focus on the usage of CountVectorizer
in a ColumnTransformer
instance, though they did not consider imputing the dataset beforehand.