[SOLVED] Read_parquet function of duckdb from MinIO issue

Read_parquet function of duckdb from MinIO issue

I am trying to query a parquet file using duckdb. the parquet file is stored in MINIO. I am using Jupyter notebook. the code is as below

def queryduckdb(bucketname, parquetfilepath):

    try:

        # Establish a connection
        conn = duckdb.connect()
        
        # Load the httpfs extension
        conn.execute("LOAD httpfs")
        
        # Set MinIO configuration
        conn.execute("SET s3_region = 'ap-south-1'")
        conn.execute("SET s3_access_key_id = 'abcded'")
        conn.execute("SET s3_secret_access_key = 'abcdd'")
        conn.execute("SET s3_endpoint = 'http://172.20.20.101:9000'")
        conn.execute("SET s3_use_ssl = false")  # Use true if MinIO uses HTTPS
        
        # Construct and print the URL
        url = f's3://{bucketname}/{parquetfilepath}'
       
        # Construct the query
        query = f"SELECT * FROM read_parquet('{url}')"
               
        # Execute the query and fetch results
        result = conn.execute(query).fetchall()

        return result

    except Exception as e:
        # Print or log the exception message
        print(f"Exception: {e}")

    finally:
        # Close the connection
        conn.close()

bucketname="bucketname"
parquetfile = "sData/MarketDetails/Year=2024/1.parquet"
queryduckdb(bucketname,parquetfile)

the url construction is s3://bucketname/sData/MarketDetails/Year=2024/1.parquet

But I am getting below error

IO Error: Connection Error for HTTP Head to 'http://bucketname.http://172.20.20.101%3A9000/sData/MarketDetails/Year=2024/1.parquet'

Why there are two http in the error? the point of concern here is in the error we can see bucketname.http//endpoint/parquetfile. why bucketname comes first and then endpoint. why bucketname and parquetfile are separate.

Kindly guide

Solution

Why there are two http in the error?

Because you specified "http" in the endpoint. It would appear that you should instead be using:

conn.execute("SET s3_endpoint = '172.20.20.101:9000'")

You may also wish to consider using:

conn.execute("SET s3_url_style = 'path'")