I am trying to replicate some data preprocessing that I have done in pandas into tensorflow transform.
I have a few CSV files, which I joined and aggregated with pandas to produce a training dataset. Now as part of productionising the model I would like this preprocessing to be done at scale with apache beam and tensorflow transform. However it is not quite clear to me how I can reproduce the same data manipulation there. Let's look at two main operations: JOIN
dataset a
and dataset b
to produce c
and group by col1
on dataset c
. This would be a quite straightforward operation in pandas, but how would I do this in tensorflow transform running on apache beam? Am I using the wrong tool for the job? What would be the right tool then?
You can use the Beam Dataframes API to do the join and other preprocessing exactly as you would have in Pandas. You can then use to_pcollection
to get a PCollection that you can pass directly to your Tensorflow Transform operations, or save it as a file to read in later.
For top-level functions (such as merge) one needs to do
from apache_beam.dataframe.pandas_top_level_functions import pd_wrapper as beam_pd
and use operations beam_pd.func(...)
in place of pd.func(...)
.