google-cloud-storageparquetduckdb

DuckDB persist secrets across sessions or provide an alternative config file for GCS credentials


I want to read Parquet files stored in a GCS bucket via DuckDB as CLI i.e. duckdb in an environment where I setup a Service Account and I created the HMAC credentials like

gcloud storage hmac create \
  my-sa@my-project.iam.gserviceaccount.com \
  --project=my-project > `pwd`/sensitive/hmac.yaml

I cannot find a way to persist the GCS credentials across duckdb sessions. I am using a recent version 1.2.2 of the duckdb CLI.

I setup the duckdb_wrapper below, which works, but it's a bit ugly and convoluted. Is there a way to store somewhere some .config or .duckdb config file with this sensitive information. Or perhaps some environment variables containing these secrets?

# setup the secret to be able to access the GCS bucket with the HMAC credentials
cat <<EOL > "/sensitive/duckdb_init.sql"
CREATE SECRET (
    TYPE gcs,
    KEY_ID '${GCS_ACCESS_KEY_ID}',
    SECRET '${GCS_SECRET_ACCESS_KEY}'
);
EOL
duckdb -c ".read /sensitive/duckdb_init.sql"

# https://duckdb.org/docs/stable/guides/network_cloud_storage/gcs_import.html
duckdb -s "
INSTALL httpfs;
LOAD httpfs;
" || exit 1

# DuckDB wrapper script in /usr/local/bin
cat <<EOF > /usr/local/bin/duckdb_wrapper
#!/bin/bash

# run the original duckdb command with any arguments provided
exec duckdb -c ".read /sensitive/duckdb_init.sql" "\$@"
EOF
chmod +x /usr/local/bin/duckdb_wrapper

After doing the above, I am able to invoke duckdb in both ways:

duckdb -c '.read /sensitive/duckdb_init.sql' \
  -s "SELECT * FROM read_parquet('gs://my-bucket/date=2018-03-03/*.parquet');"
duckdb_wrapper \
  -s "SELECT * FROM read_parquet('gs://my-bucket/date=2018-03-03/*.parquet');"

but as mentioned before it looks ugly and convoluted, is there anything better and cleaner I could do?


Solution

  • The ~/.duckdbrc file described here https://duckdb.org/docs/stable/operations_manual/footprint_of_duckdb/files_created_by_duckdb#global-files-and-directories worked out pretty well, here's an example configuration setup that is executed at each DuckDB session init when the CLI duckdb command is invoked:

    cat <<EOL > ~/.duckdbrc
    -- setup the secret to be able to access the GCS bucket with the HMAC credentials
    -- https://duckdb.org/docs/stable/operations_manual/footprint_of_duckdb/files_created_by_duckdb#global-files-and-directories
    --
    -- these environment variables must be present: 
    -- GCS_ACCESS_KEY_ID and GCS_SECRET_ACCESS_KEY
    CREATE SECRET (
        TYPE gcs,
        KEY_ID '${GCS_ACCESS_KEY_ID}',
        SECRET '${GCS_SECRET_ACCESS_KEY}'
    );
    -- parquet and GCS buckets setup
    -- https://duckdb.org/docs/stable/guides/network_cloud_storage/gcs_import.html
    INSTALL httpfs;
    LOAD httpfs;
    
    SELECT 'My DuckDB session initialized!' AS my_message;
    EOL