google-cloud-platformpysparkgoogle-bigquerygoogle-cloud-dataprocbilling

BigQuery scanning cost for Dataproc


I am implementing a data transformation for my business that involves fetching data from one massive table(~20 TB) and several other smaller tables (<100 MB) located in BigQuery. I might fetch the entire table or by the default date partition as the use case may be. There are a series of transformations like Joining, Filtering, Aggregation, Union that follows. The performance in Dataproc is pretty impressive as it is comparable and in some cases faster than BigQuery. I have used 6 worker nodes of type n2-standard-16 and Boot disk of 200 GB each.

My question is - Does fetching the data in Dataproc from BigQuery using the Spark BigQuery connector lead to slot usage or any kind of cost implication on BigQuery end? I am particularly concerned with the big table in my PySpark job.

I tried to look for online documentation mentioning the BigQuery usage cost from Dataproc but could not find any. Any reference URL or a detailed explanation of the 'BigQuery scanning cost from Dataproc' perspective would be highly appreciated.


Solution

  • If you are using the Spark BigQuery connector, the pricing considerations are described in this doc.

    My understanding is that

    1. For reading from BigQuery, users only pay for the BigQuery Storage Read API calls, see the BQ connector doc and the BigQuery Storage Read API pricing.

    2. For writing to BigQuery

      2.1) in direct write mode (recommended) which is based on the BigQuery Storage Write API, users only pay for the API calls, see the BigQuery Storage Write API pricing.

      2.2) in in-direct write mode (legacy) which is based on temporary GCS files, users only pay for the temporary GCS storage, see the pricing is in this doc.

    Dataproc doesn't charge extra in addition to the BigQuery or GCS cost mentioned above.

    BTW, another thing I noticed is that the worker boot disk size 200GB is too small. PD I/O throughput is proportional to size, see this doc, so small disks can cause longer job duration and higher total cost due to higher cost for CPUs. The recommended size is at least 1TB for standard PD, or use local SSD, see this doc.