google-bigquerygoogle-data-catalog

Why do BigQuery policy tagged columns prevent access to other columns when used in a GROUP BY statement?


We are exploring the use of BigQuery policy tags for implementing column level security. We have discovered that when a column that a user does not have access to is included in a GROUP BY clause then the user cannot access other columns in the GROUP BY clause either.

Let me explain with a contrived example. Imagine we have a table t with two columns:

These columns have policy tags on them which grant our end users permission to view the data in t.a but not the data in t.b.

Hence our end users can issue this query:

select a 
from t

but when issuing this query:

select b 
from t

they get an error:

Access Denied: BigQuery BigQuery: User does not have permission to access policy tag "policy-tag-name" on column project.dataset.t.b.

This is expected and is by design.

However if a user issues this query:

select a 
from (
    select a, b 
    from t 
    group by a, b
)

then the same error occurs.

Why is this? The user is not accessing any data that they are not allowed to access. I am struggling to comprehend why the group by statement causes this error. Can anyone enlighten me?


Solution

  • This query does not expose values b, but it would expose some information about b if it were allowed.

    E.g. it lets one know how many different values of b each value of a has. Say a is a product id, and b is seller id. Query would show how many separate sellers each product has. But whoever decided to hide sellers probably does not want this. Or a is some employee group (e.g. department or manager id), and b is salary. This query would expose whether they have the same salary. Policy prevents this.

    Note that there could be more complex queries where referencing b does not expose anything about b, but policy prevents it anyway. It is also by design - BigQuery plays it safe, and if it cannot prove such reference does not expose any information, it prohibits it.