After following the dataflow tutorial, I used the pub/sub topic to big query template to parse a JSON record into a table. The Job has been streaming for 21 days. During that time I have ingested about 5000 JSON records, containing 4 fields (around 250 bytes).
After the bill came this month I started to look into resource usage. I have used 2,017.52 vCPU hr, memory 7,565.825 GB hr, Total HDD 620,407.918 GB hr.
This seems absurdly high for the tiny amount of data I have been ingesting. Is there a minimum amount of data I should have before using dataflow? It seems over powered for small cases. Is there another preferred method for ingesting data from a pub sub topic? Is there a different configuration when setting up a Dataflow Job that uses less resources?
It seems that the numbers you mentioned, correspond to not customizing the job resources. By default streaming jobs use a n1-standar-4 machine:
3 Streaming worker defaults: 4 vCPU, 15 GB memory, 400 GB Persistent Disk.
4 vCPU x 24 hrs x 21 days = 2,016
15 GB x 24 hrs x 21 days = 7,560
If you really need streaming in Dataflow, you will need to pay for resources allocated even if there is nothing to process.
Options:
Optimizing Dataflow
Your own way
Cloud functions (my recommendation)
"Cloud Functions provides a perpetual free tier for compute-time resources, which includes an allocation of both GB-seconds and GHz-seconds. In addition to the 2 million invocations, the free tier provides 400,000 GB-seconds, 200,000 GHz-seconds of compute time and 5GB of Internet egress traffic per month."[1]