sql-serverparquet

Could not load external library error when reading from OPENROWSET parquet file


I want to read a local parquet file but when executing:

SELECT * FROM OPENROWSET(
    BULK 'H:\Linolium\somename.parquet', FORMAT = 'PARQUET') AS data;

I get:

Could not load external library 'E:\DB\MSSQL16.SQL2022\MSSQL\Binn\extractors\ParquetReaderExtractor.dll'.

SQL Server version is:

Microsoft SQL Server 2022 (RTM-CU18) (KB5050771) - 16.0.4185.3 (X64)

Is this not supported, some bug or is my installation missing something? I don't have either Polybase or Azure extensions installed.

After installing polybase, the error I get is:

Connector prefix 'SomePath\small.parquet' is not supported for specified FORMAT 'PARQUET'.

Tried both unc and drive paths, same error


Solution

  • It seems it's not possible to setup Polybase/SQL Server to read local parquet files.

    What I did eventually was to run a local s3 storage using localstack s3.

    Just for posterity, here are the "basic" instructions on how to set it up on SQL Server running on your computer:

    1. Install wsl, wsl2 and docker desktop
    2. Setup localstack s3. I used the instructions in https://github.com/own3dh2so4/localstack_s3_example?tab=readme-ov-file to do it. The important thing if you run windows is to change the playground/localstack/s3.sh script to use Linux line ending \n, otherwise the script won't be run when localstack seta things up.
    3. The bucket name can be changed in file .env.localstack, BUCKET_NAMES setting.
    4. After running docker, you can upload the files into the bucket. I used something like:
    1. Verify in your local browser: http://host.docker.internal:4566/mys3bucket. It should list the parquet file

    Now, you're ready for the SQL Server part!

    For s3 stuff one needs credentials, which needs a database master key. If you don't already have a database master key, create one with:

    USE yourDatabase
    CREATE MASTER KEY ENCRYPTION BY PASSWORD ='sup3rs3cr3tpass'
    

    Now, s3 needs a credential and a data source.

    CREATE DATABASE SCOPED CREDENTIAL [s3_ds]
    WITH IDENTITY = 'S3 Access Key',
    SECRET = 'test:test';
    

    It's very important that identity = S3 Access Key. The secret is aws_access_key_id:aws_secret_access_key from the \playground\aws\credentials file.

    Finally, you can create the Data source. This took me a while to get working, because localhost doesn't work. The address that worked was:

    CREATE EXTERNAL DATA SOURCE s3_ds
    WITH
    (
        LOCATION = 's3://s3.localhost.localstack.cloud:4566' -- Not s3://localhost:4566!
    ,   CREDENTIAL = [s3_ds]
    )
    

    The basic explanation is that s3.localhost.localstack.cloud:4566 moniker allows the "virtual host based" URL access ie: /bucket/file, while the other one (localhost...) is Path based URL access, which won't work with SQL Server, because it prefixes the URL wrongly.

    So, now, you can query your file:

    SELECT *
    FROM OPENROWSET
    (
        BULK '/your-bucket-name/yourfile.parquet'
    ,   FORMAT       = 'PARQUET'
    ,   DATA_SOURCE  = 's3_ds'
    ) cc
    

    If you get file cannot be accessed or not exist, that means you messed something up (I did many times).