Using Microsoft's Open Datasets (here), I'd like to create (external) tables in my Databricks env available for consumption in Databricks SQL env and external (BI tools) to this parquet source.
Bit confused on the right approach. Here's what I've tried.
Approach 1: I've tried to create a mount_point (/mnt/taxiData) to the open/public azure store from which I'd use the normal CREATE TABLE dw.table USING PARQUET LOCATION '/mnt/taxi'
using the following python code, however, I get an ERROR: Storage Key is not a valid base64 encoded string
.
Note: This azure store is open, public. There is no key, no secret.get required.
storageAccountName = "azureopendatastorage"
storageAccountAccessKey = r""
blobContainerName = "nyctlc"
dbutils.fs.mount(
source = "wasbs://{}@{}.blob.core.windows.net".format(blobContainerName, storageAccountName),
mount_point = "/mnt/taxiData",
extra_configs = {'fs.azure.account.key.' + storageAccountName + '.blob.core.windows.net': storageAccountAccessKey}
)
Approach - 2: Use: CREATE TABLE dw.table USING PARQUET LOCATION 'wasbs://nyctlc@azureopendatastorage.blob.core.windows.net/yellow/'
since the wasbs location is open/public.
Result - I get no error, but I get an external empty table with the correct schema, but no data.
Approach -3: If I create a dataframe (df = spark.read.parquet('wasbs://..), it works fine. I get a DF with data. However, if I then add .write.saveAsTable(dw.table) - it pulls the data from the Open Store over to my /dbfs store. Which is not what I want.
Best approach & fix?
For Approach 1, I think that the check is too strict in the dbutils.fs.mount
- it makes sense to report this as an error to Azure support.
Approach 2 - it's not enough to create a table, it also needs to discover partitions (Parquet isn't a Delta where partitions are discovered automatically). You can do that with the MSCK REPAIR TABLE SQL command. Like this:
MSCK REPAIR TABLE dw.table
and after it's finished you'll see the data: