I was trying to auto tag InfoTypes like PhoneNumber, EmailId on the data in GCS Bucket and Big Query External tables using Data Loss Prevention Tool in GCP so that i can have those tags at Data Catalog and subsequently in Dataplex. Now the problems are that
The question is that, is my understanding of GCP DLP - Data Catalog integration works only for Internal Big Query tables correct? Am doing something wrong here, GCP documentation doesn't mention these things either!
Also while configuring the Inspection Job from the DLP UI Console, i had to provide Big Query tableid mandatorily, is there a way i can run DLP inspection job against a BQ Dataset or a bunch of tables?
Regarding Data Loss Prevention Services in Google Cloud, your understanding is correct, data cannot be ex-filtrated by copying to services outside the perimeter, e.g., a public Google Cloud Storage (GCS) bucket or an external BigQuery table. Visit this URL for more reference.
Now, about how to run a DLP Inspection job against a BQ bunch of tables, there are 2 ways to do it:
url =
String.format(
"jdbc:bigquery://https://www.googleapis.com/bigquery/v2:443;OAuthType=3;ProjectId=%s;",
projectId);
DataSource ds = new com.simba.googlebigquery.jdbc42.DataSource();
ds.setURL(url);
conn = ds.getConnection();
DatabaseMetaData databaseMetadata = conn.getMetaData();
ResultSet tablesResultSet =
databaseMetadata.getTables(conn.getCatalog(), null, "%", new String[]{"TABLE"});
while (tablesResultSet.next()) {
// Query your Table Data and call DLP Streaming API
}
Here is a tutorial for this method.
client = bigquery.Client()
datasets = list(client.list_datasets(project=project_id))
if datasets:
for dataset in datasets:
tables = client.list_tables(dataset.dataset_id)
for table in tables:
# Create Inspect Job for table.table_id
Use this thread for more reference on running a DLP Inspection job against a BQ bunch of tables.