So I currently have a Pipeline that has a lot of customer transformers:
p = Pipeline([
("GetTimeFromDate",TimeTransformer("Date")), #Custom Transformer that adds ["time"] column
("GetZipFromAddress",ZipTransformer("Address")), #Custom Transformer that adds ["zip"] column
("GroupByTimeandZip",GroupByTransformer(["time","zip"]) #Custom Transformer that adds onehot columns
])
Each transformer takes in a pandas dataframe and returns the same dataframe with one or more new columns. It actually works quite well, but how can I run the "GetTimeFromDate" and the "GetZipFromAddress" steps in parallel?
I would like to use FeatureUnion:
f = FeatureUnion([
("GetTimeFromDate",TimeTransformer("Date")), #Custom Transformer that adds ["time"] column
("GetZipFromAddress",ZipTransformer("Address")), #Custom Transformer that adds ["zip"] column])
])
p = Pipeline([
("FeatureUnionStep",f),
("GroupByTimeandZip",GroupByTransformer(["time","zip"]) #Custom Transformer that adds onehot columns
])
But the problem is that FeatureUnion returns a numpy.ndarray, but the "GroupByTimeandZip" step needs a dataframe.
Is there a way I can get FeatureUnion to return a pandas dataframe?
For a FeatureUnion
to output a DataFrame
you can use the PandasFeatureUnion
from this blog post. Also see the gist.