I want to read a local parquet file but when executing:
SELECT * FROM OPENROWSET(
BULK 'H:\Linolium\somename.parquet', FORMAT = 'PARQUET') AS data;
I get:
Could not load external library 'E:\DB\MSSQL16.SQL2022\MSSQL\Binn\extractors\ParquetReaderExtractor.dll'.
SQL Server version is:
Microsoft SQL Server 2022 (RTM-CU18) (KB5050771) - 16.0.4185.3 (X64)
Is this not supported, some bug or is my installation missing something? I don't have either Polybase or Azure extensions installed.
After installing polybase, the error I get is:
Connector prefix 'SomePath\small.parquet' is not supported for specified FORMAT 'PARQUET'.
Tried both unc and drive paths, same error
It seems it's not possible to setup Polybase/SQL Server to read local parquet files.
What I did eventually was to run a local s3 storage using localstack s3.
Just for posterity, here are the "basic" instructions on how to set it up on SQL Server running on your computer:
playground/localstack/s3.sh script to use Linux line ending \n, otherwise the script won't be run when localstack seta things up..env.localstack, BUCKET_NAMES setting.\playground\aws, this will enable it for docker access.aws s3 cp /root/.aws/yourfile.parquet s3://<yourbucket name> --checksum-algorithm SHA256 to upload the file to s3. Old guides omit checksum-algo parts, which might lead to error (invalid checksum algo).http://host.docker.internal:4566/mys3bucket. It should list the parquet fileNow, you're ready for the SQL Server part!
For s3 stuff one needs credentials, which needs a database master key. If you don't already have a database master key, create one with:
USE yourDatabase
CREATE MASTER KEY ENCRYPTION BY PASSWORD ='sup3rs3cr3tpass'
Now, s3 needs a credential and a data source.
CREATE DATABASE SCOPED CREDENTIAL [s3_ds]
WITH IDENTITY = 'S3 Access Key',
SECRET = 'test:test';
It's very important that identity = S3 Access Key. The secret is aws_access_key_id:aws_secret_access_key from the \playground\aws\credentials file.
Finally, you can create the Data source. This took me a while to get working, because localhost doesn't work. The address that worked was:
CREATE EXTERNAL DATA SOURCE s3_ds
WITH
(
LOCATION = 's3://s3.localhost.localstack.cloud:4566' -- Not s3://localhost:4566!
, CREDENTIAL = [s3_ds]
)
The basic explanation is that s3.localhost.localstack.cloud:4566 moniker allows the "virtual host based" URL access ie: /bucket/file, while the other one (localhost...) is Path based URL access, which won't work with SQL Server, because it prefixes the URL wrongly.
So, now, you can query your file:
SELECT *
FROM OPENROWSET
(
BULK '/your-bucket-name/yourfile.parquet'
, FORMAT = 'PARQUET'
, DATA_SOURCE = 's3_ds'
) cc
If you get file cannot be accessed or not exist, that means you messed something up (I did many times).