I am working with two tables in BigQuery:
rowKey
.My goal is to query tt to fetch only the rows where rowKey matches the values in qt.rowKey
.
Here’s what I’ve tried so far:
Attempt 1: Using a subquery with IN
SELECT t.*
FROM tt t
WHERE t.rowKey IN (SELECT rowKey FROM qt)
Issue: This query runs indefinitely and does not finish.
Attempt 2: Using a LEFT JOIN
SELECT t.*
FROM qt
LEFT JOIN tt ON tt.rowKey = qt.rowKey
Issue: This query also takes an extremely long time and does not finish.
Attempt 3: Hardcoding a small subset of keys
SELECT t.*
FROM tt
WHERE rowKey IN ("a", "b", "c", "d", "e", "f", "g", "h", "i", "j")
Result: This query runs successfully in a few seconds.
It seems like the IN clause or the JOIN is causing a full table scan on the external table (tt). When I hardcode a small set of rowKey values, the query performs as expected.
I want to efficiently query tt for the rowKey values present in qt.rowKey without causing a full scan of the massive tt table.
How can I optimize this query to avoid a full table scan on the tt external table? Are there specific techniques or best practices for querying Bigtable data in BigQuery with filters from a smaller table?
Posting this as an answer to my own question to help others who might run into the same issue.
After trying several approaches (IN
subquery, JOIN
, etc.), I realized the core problem is that BigQuery performs a full scan on external Bigtable tables when filtering using another table—even if the filter table is small. This made the query either extremely slow or impossible to complete.
To solve this, we ended up syncing the relevant Bigtable data into a native BigQuery table using the BigQuery Storage API. Here's what we did:
rowKey
to a Pub/Sub topic.This allowed us to keep our queries entirely within BigQuery and avoid using the external table. Once we made this change, our queries became fast and reliable.
We also considered using Dataflow with Bigtable CDC (Change Data Capture), which can be a good option if you don’t control the update source. However, in our case, dataflow was significantly more expensive (about 10x), less stable, and CDC increased Bigtable resource usage by around 20%, so we stuck with our custom solution.
In short: if you need to filter a Bigtable external table in BigQuery using another table, you're likely better off syncing the data into BigQuery instead. External tables just aren’t optimized for this kind of use.