There is a similar question asked here on SO many years back but there was no answer. I have the same question. I would like to add in new column(s) of data, in my case 3 columns for dummy variables, to a sparse matrix (from TfidfVectorizer
), before building a Pipeline
and conducting a GridSearch
to find the best hyper parameters.
Currently, I am able to do this model by model without GridSearch
and Pipeline
using the code below.
# this is an NLP project
X = df["text"] # column of text
y = df["target"] # continuous target variable
X_train, X_unseen, y_train, y_unseen = train_test_split(X, y, test_size=0.5, stratify=df_merged["platform"], random_state=42)
# vectorize
tvec = TfidfVectorizer(stop_words="english")
X_train_tvec = tvec.fit_transform(X_train)
# get dummies
dummies = pd.get_dummies(df["dummies"]).values
# add dummies to tvec sparse matrix
X_train_tvec_dumm = hstack([X_train_tvec, dummies]).toarray()
From here, I can fit my model onto the X_train_tvec_dumm
training data which includes the sparse matrix (shape: n_rows, n_columns) of word vectors from TfidfVectorizer
and 3 dummy columns. The final shape is therefore (n_rows, n_columns + 3).
I tried to build the Pipeline
as follows.
# get dummies
dummies = pd.get_dummies(df["dummies"]).values
def add_dummies(matrix):
return hstack([matrix, dummies]).toarray()
pipe = Pipeline([
("features", FeatureUnion([
("tvec", TfidfVectorizer(stop_words="english")),
("dummies", add_dummies(??)) <-- how do I add this step into the pipeline?
])),
("ridge", RidgeCV())
])
pipe_params = {
'features__tvec__max_features': [200, 500],
'features__tvec__ngram_range': [(1,1), (1,2)]
}
gs = GridSearchCV(pipe, param_grid=pipe_params, cv=4)
gs.fit(X_train, y_train)
print(gs.best_score_)
There is this tutorial that described how to build a custom transformer for the Pipeline
but its custom function adds a new feature engineered by transforming on X_train. My dummy variables are unfortunately external to the X_train set.
Instead of get_dummies
from pandas use OneHotEncoder
from sklearn.preprocessing
, together with ColumnTransformer
from sklearn.compose
. Make a DataFrame with both 'text' and 'category|dummies' columns as features.
OneHotEncoder
expects integers type features. If your features are not to int type, first encode|map them to ints, then apply OneHotEncoder
.
# ...
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
text_features = 'text'
text_transformer = Pipeline(steps=[
('vectorizer', TfidfVectorizer(stop_words="english"))])
categorical_features = ['category']
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore'))])
preprocessor = ColumnTransformer(
transformers=[
('text', text_transformer, text_features),
('cat', categorical_transformer, categorical_features)])
pipe = Pipeline(steps=[('preprocessor', preprocessor),
("ridge", RidgeCV())
])
# ...