google-cloud-storagegoogle-cloud-dataflowserverlessgoogle-cloud-pubsubhigh-load

Process 350k requests per seconds and save data to Google Cloud Storage


I need to implement microservice which is fairly simple in terms of logic and architecture, but needs to handle around 305k requests per second.

All it's going to do is to ingest JSON data, validate it according to simple rules and record to Google Cloud Storage as JSON files. There are lots of Google Cloud services and APIs available, but it's hard for me to pick proper stack and pipeline because I have not had much experience with them as well as with highload.

There is an example I'm looking at https://cloud.google.com/pubsub/docs/pubsub-dataflow

The flow is the following:

PubSub > Dataflow > Cloud Storage

It does exactly what I need (except date validation) but looks like Dataflow is limited to Java and Python, and I'd rather use PHP.

Another relevant example is https://medium.com/google-cloud/cloud-run-using-pubsub-triggers-2db74fc4ac6d

It uses Cloud Run, with supports PHP, and PubSub to trigger Cloud Run workload. So it goes like:

PubSub > Cloud Run 

and working with Cloud Storage in Run looks pretty simple.

Am I on a right way? Can something like mentioned above work for me or do I need something different?


Solution

  • My first intuition when I saw 350k request per seconds and PubSub, is this pattern:

    Pubsub > Dataflow > BigTable
    

    My question validate the choice of BigTable because you can query BigTable table from BigQuery for later analysis.

    Of course, it's expensive but you have here a very scalable system.

    An alternative, if your process fits the BigQuery streaming quotas, is to stream directly into BigQuery instead of BigTable.

    Pubsub > Dataflow > BigQuery
    

    The problem with a solution of Cloud Run or App Engine, is that you will need to run a process externally (for example with Cloud Scheduler), and in this process, you will perform a loop to pull message from PubSub subscription. You will cope with several difficulties

    EDIT

    I forgot that you didn't want to code in Java or Python. I can propose you 2 alternative if your process is really simple:

    Personal opinion: coding language doesn't matter, use the right tool for the right job. Using Cloud Run or App Engine for this will create a much more unstable and hard to maintain system than learning how to write 10 lines of Java code