parquetminioduckdb

Read_parquet function of duckdb from MinIO issue


I am trying to query a parquet file using duckdb. the parquet file is stored in MINIO. I am using Jupyter notebook. the code is as below

def queryduckdb(bucketname, parquetfilepath):

    try:

        # Establish a connection
        conn = duckdb.connect()
        
        # Load the httpfs extension
        conn.execute("LOAD httpfs")
        
        # Set MinIO configuration
        conn.execute("SET s3_region = 'ap-south-1'")
        conn.execute("SET s3_access_key_id = 'abcded'")
        conn.execute("SET s3_secret_access_key = 'abcdd'")
        conn.execute("SET s3_endpoint = 'http://172.20.20.101:9000'")
        conn.execute("SET s3_use_ssl = false")  # Use true if MinIO uses HTTPS
        
        # Construct and print the URL
        url = f's3://{bucketname}/{parquetfilepath}'
       
        # Construct the query
        query = f"SELECT * FROM read_parquet('{url}')"
               
        # Execute the query and fetch results
        result = conn.execute(query).fetchall()

        return result

    except Exception as e:
        # Print or log the exception message
        print(f"Exception: {e}")

    finally:
        # Close the connection
        conn.close()

bucketname="bucketname"
parquetfile = "sData/MarketDetails/Year=2024/1.parquet"
queryduckdb(bucketname,parquetfile)

the url construction is s3://bucketname/sData/MarketDetails/Year=2024/1.parquet

But I am getting below error

IO Error: Connection Error for HTTP Head to 'http://bucketname.http://172.20.20.101%3A9000/sData/MarketDetails/Year=2024/1.parquet'

Why there are two http in the error? the point of concern here is in the error we can see bucketname.http//endpoint/parquetfile. why bucketname comes first and then endpoint. why bucketname and parquetfile are separate.

Kindly guide


Solution

  • Why there are two http in the error?

    Because you specified "http" in the endpoint. It would appear that you should instead be using:

    conn.execute("SET s3_endpoint = '172.20.20.101:9000'")
    

    You may also wish to consider using:

    conn.execute("SET s3_url_style = 'path'")