sqlgoogle-bigquerydatabase-cluster

BigQuery – Is there a limited number of rows a clustered field can scan through when applying WHERE shop IN (select shop from X)?


As an example:

Say I've taken table_1; partitioned it by day and clustered it by shop and I've tried using the list of shop in table_2 (not clustered or partitional) to filter by from my clustered field in table_1.

Then, when I query something like:

SELECT shop, COUNT(id) count_id
FROM table_1
WHERE shop IN (SELECT shop FROM table_2 ORDER BY shop)
GROUP BY shop

I get the full upper bound of 8.9GB processed.

My processing time should be much lower since; table_1 contains say 10m rows of shop names and table_2 contains only 2m of those rows.

Note that when I query something like:

SELECT shop, COUNT(id) count_id
FROM table_1
WHERE shop IN (SELECT shop FROM table_2 ORDER BY shop LIMIT 10000)
GROUP BY shop

I process like 230MB?

In the second query; I'm limiting my rows to 10k, which seems to work but if table_1 contains 10m rows and table_1 contains 2m rows, then that's already me narrowing it down to 2m rows max right?

My main quest here is: Is there a certain threshold for how many my clustered field can scan through without processing the entire upper bound?

Hope that all made sense!! If not, please do let me know and I'll try to re-explain.

Thank you.


Solution

  • The difference lies here:

    WHERE shop IN (SELECT shop FROM table_2 ORDER BY shop LIMIT 1000)

    Usually you don't need ORDER BY for a IN() operation - but when you do IN (ORDER BY x LIMIT y) - then you are only selecting the first 1000 shops, all with id's between - let's say - a and d.

    Clustering has sorted your main table contents by shop - hence this IN will only open the clusters that contain shops between a and d. Cost savings!

    But if you did something like:

    WHERE shop IN('a','b','c','d',....,'z')

    there would be almost no cost savings, because every cluster of data needs to be opened to find in any of all these clusters to see if there's a shop with that name.

    Check the differences between these 3 queries - and note that DISTINCT is very important too:

    SELECT *
    FROM `fh-bigquery.weather_gsod.all`
    WHERE name IN
      ('A', 'C', 'E', 'H', 'J', 'L', 'M', 'O', 'R', 'S') 
    LIMIT 10
    
    # 0.5 sec elapsed, 2.6 GB processed
    
    SELECT *
    FROM `fh-bigquery.weather_gsod.all`
    WHERE name IN
       (SELECT name FROM `fh-bigquery.weather_gsod.all` WHERE name BETWEEN 'A' AND 'C')
    LIMIT 10
    
    # 22.5 sec elapsed, 45.6 GB processed
    
    SELECT *
    FROM `fh-bigquery.weather_gsod.all`
    WHERE name IN
       (SELECT DISTINCT name FROM `fh-bigquery.weather_gsod.all` WHERE name BETWEEN 'A' AND 'C')
    LIMIT 10
    
    # 3.2 sec elapsed, 558.7 MB processed