pythongoogle-cloud-storageetlgoogle-cloud-dataflowapache-beam-io

How to migrate file from on-prem to GCS?


I want to build an ETL pipeline that:

  1. Read files from the filesystem on-prem
  2. Write the file into a Cloud Storage bucket. Is it possible to import the files (regurarly, every day) directly with the Storage Transfer Service? Let's suppose I want to build the pipeline with Dataflow (with Python as programming language). Is it possible to implement such workflow? If yes, are there any Python exaples with Apache Beam?

Thank you in advance


Solution

  • Since you stated that importing is a daily task, you may opt to use Cloud Composer instead of Dataflow, as discussed in this SO post. You can check the product details here. Cloud Composer uses Apache Airflow. You can use sftpOperator and localtogcsOperator to achieve your requirement.

    If you opt to use Cloud Composer, you can post another question in SO for this specific product with correct tagging so that others in the community can easily find answer to your question, and I will gladly share a working code with correct output with you.