I have 2 GCS buckets with identical, sharded CSV files. Bucket federated-query-standard has storage class of standard
. Bucket federated-query-archive has storage class of archive
.
Running identical queries using federated/external source over the buckets produce to exact same amount of bytes billed/processed, which is 57.13GB of data. Performance (query time) is roughly the same.
According to the official docs for BigQuery pricing:
"When querying an external data source from BigQuery, you are charged for the number of bytes read by the query. For more information, see Query pricing. You are also charged for storing the data on Cloud Storage. For more information, see Cloud Storage Pricing."
So, users are charged on two things: the data processed and the storage of the data in GCS. This makes complete sense.
My question: is there is a hidden cost anywhere that I'm not seeing (or unaware of) for querying GCS (e.g. retrieval costs) or between different storage classes?
Currently, there aren't any charges for reading from Archival or Coldine storage, hidden or otherwise. That doesn't mean this won't change in the future.
Because of the way BigQuery accesses GCS, GCS charges BigQuery for the access, not you (i.e. an internal accounting thing).
Performance may be inconsistent if you use archival storage. For that storage class, there are fewer redundant copies so tail latency will be higher.
For coldline, however, you should see roughly equivalent performance to standard GCS storage. The reason is that under the covers, coldline is implemented exactly the same way as standard storage. The difference is that coldline charges less for storage but makes it up on reads.
Since BigQuery doesn't charge you for reads, if you're doing a lot of federated querying over data in GCS but don't read the data much otherwise, your best bet is going to be to use coldline.
Again, this is a point-in-time response and this may change in the future.