google-cloud-platformgoogle-cloud-dataflowgoogle-cloud-dataprepgoogle-cloud-data-fusion

Can Google Data Fusion make the same data cleaning than DataPrep?


I want to run a machine learning model with some data. Before train the model with this data I need to process it, so I have been reading some ways to do it.

  1. First of all create a Dataflow pipeline to upload it to Bigquery or Google Cloud Storage, then create a data pipeline with Google Dataprep to clean it.

  2. The other way I reat to do it is with Data Fusion, that can create data pipelines more easier, but I don't know and here is my doubt, data Fusion it is only to create a pipeline like Dataflow and then I have to use DataPrep to clean the data or if Data Fusion can clean the data and prepare it to put into my machine learning model.

If Data Fusion can clean the data as DataPrep, when I should use DataPrep?


Solution

  • Datafusion and Dataprep can perform the same things. However their execution are different.

    IMO, Datafusion is more designed for data ingestion from one source to another one, with few transformation. Dataprep is more designed for data preparation (as its name means), data cleaning, new column creation, splitting column. Dataprep also provide insight of the data for helping you in your recipes.

    In addition, Beam is a part of Tensorflow extended and your Data engineer pipeline will be more consistent if you use a tool compliant with Beam

    That's why I will recommend Dataprep instead Datafusion.