pythongoogle-cloud-platformgoogle-cloud-rundataflow

Running large pipelines on GCP


I want to scale on cloud a one off pipeline I have locally.

  1. The script takes data from a large (30TB), static S3 bucket made up of PDFs
  2. I pass these PDFs in a ThreadPool to a Docker container, which gives me an output
  3. I save the output to a file.

I can only test it locally on a small fraction of this dataset. The whole pipeline would take a couple days to run on a MacbookPro.

I've been trying to replicate this on GCP - which I am still discovering.

What is the best way to run such a python data processing pipeline with a container on GCP ?


Solution

  • Thanks to the useful comments in the original post, I explored other alternatives on GCP.

    Using a VM on Compute Engine worked perfectly. The overhead and setup is much less than I expected ; the setup went smoothly.