azurematillion

Load data from public Azure blob in Matillion


I am going through Matillion Academy (Building a Data Warehouse). There is a slide deck to follow online and I am running my own instance of Matillion to recreate the building of the warehouse. My Matillion is on Azure, as is my Snowflake database. The training is AWS-based, but gives information about the adjustments needed for Azure or GS.

One of the steps shows how to Load data from blob storage. It is S3 based. For Azure different components need to be used (as the S3 ones don't exist there), and data needs to be loaded from azure storage instead of S3 storage. It also explains that for Snowflake on Azure yet another component needs to be used.

I have created a Stage in Snowflake:

    CREATE STAGE "onlinemtlntrainingazure_flights" 
    URL='azure://onlinemtlntrainingazure.blob.core.windows.net/online-mtln-training-azure-flights'

The stage shows in Snowflake (external stage) and in Matillion (when using 'manage stages' on the database). The code is taken from the json file I imported to create the job to do this (see first step below).

I have created the target table in my database. It is accessible and visible in Matillion IDE.

The adjusted component I am to use is 'Azure Blob Storage Load'. According to the documentation, I will need:

For Snowflake on Azure: Create a Stage in Snowflake:

You should create a Stage in Snowflake which will be pointing to the public data we provide. Please, find below the .json file containing the job that will help you to do this. Don't forget to change the SQL Script for pointing to your own schema

After Creating the Stage in Snowflake:

You should use the 'Create Table' and the 'Azure Blob Storage Load' components individually as the 'Azure Blob Load Generator' won't let you to select the Stage previously created. We have attached below the Create Table metadata to save you some time.

'Azure Blob Storage Load' Settings:

Stage: onlinemtlntrainingazure_flights Pattern: training_azure_flights_2016.gz Target Table: training_flights Record Delimiter: 0x0a Skip Header: 1

The source data on Azure is located here:

Azure Blob Container (with flights data)

https://onlinemtlntrainingazure.blob.core.windows.net/online-mtln-training-azure-flights/training_azure_flights_2016.gz

https://onlinemtlntrainingazure.blob.core.windows.net/online-mtln-training-azure-flights/training_azure_flights_2017.gz

https://onlinemtlntrainingazure.blob.core.windows.net/online-mtln-training-azure-flights/training_azure_flights_2018.gz

Unfortunately, when using these settings on the 'Azure Blob Storage Load' component, it complains.

Any thoughts?

Edit: I found a workaround by using the Data Transfer Object, which first copies the files from the public https location to my own Azure blob location and then I process it further from there. But I would like to know how to do it as suggested in the training, and why it now fails.


Solution

  • The example files are in a storage account that your Azure Blob Storage Load Generator can not read from. But instead of using a Snowflake Stage, you might find it easier to just copy the files into a storage account that you do own, and then use the Azure Blob Storage Load Generator on the copied files.

    In a Matillion ETL instance on Azure, you can access files over https and copy them into your own storage account using a Data Transfer component. enter image description here

    You already have the https:// source URLs for the three files, so:

    After running the Data Transfer three times, you will then be able to proceed with the Azure Blob Storage Load Generator, reading from your own copies of the files.