amazon-s3hadoophiveapache-icebergmetastore

Standalone hive metastore with Iceberg and S3


I'd like to use Presto to query Iceberg tables stored in S3 as parquet files, therefore I need to use Hive metastore. I'm running a standalone hive metastore service backed by MySql. I've configured Iceberg to use Hive catalog:

import org.apache.hadoop.conf.Configuration;
import org.apache.iceberg.catalog.Namespace;
import org.apache.iceberg.hive.HiveCatalog;

public class MetastoreTest {

    public static void main(String[] args) {
        Configuration conf = new Configuration();
        conf.set("hive.metastore.uris", "thrift://x.x.x.x:9083");
        conf.set("hive.metastore.warehouse.dir", "s3://bucket/warehouse");
        HiveCatalog catalog = new HiveCatalog(conf);
        catalog.createNamespace(Namespace.of("my_metastore"));
    }

}

I'm getting the following error: Caused by: MetaException(message:Got exception: org.apache.hadoop.fs.UnsupportedFileSystemException No FileSystem for scheme "s3")

I've included /hadoop-3.3.0/share/hadoop/tools/lib in HADOOP_CLASSPATH, also copied aws related jars to apache-hive-metastore-3.0.0-bin/lib. What else is missing?


Solution

  • Finally figured this out. First (as I already mentioned before) I had to include hadoop/share/hadoop/tools/lib in HADOOP_CLASSPATH. However neither modifying HADOOP_CLASSPATH nor copying particular files from tools to common worked for me. Then I switched to hadoop-2.7.7 and it worked. Also, I had to copy jackson related jars from tools to common. My hadoop/etc/hadoop/core-site.xml looks like this:

    <configuration>
    
        <property>
            <name>fs.default.name</name>
            <value>s3a://{bucket_name}</value>
        </property>
    
    
        <property>
            <name>fs.s3a.impl</name>
            <value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
        </property>
    
        <property>
            <name>fs.s3a.endpoint</name>
            <value>{s3_endpoint}</value>
            <description>AWS S3 endpoint to connect to. An up-to-date list is
                provided in the AWS Documentation: regions and endpoints. Without this
                property, the standard region (s3.amazonaws.com) is assumed.
            </description>
        </property>
    
    
        <property>
            <name>fs.s3a.access.key</name>
            <value>{access_key}</value>
        </property>
    
        <property>
            <name>fs.s3a.secret.key</name>
            <value>{secret_key}</value>
        </property>
    
    
    </configuration>
    

    at this point, you should be able to ls your s3 bucket:

    hadoop fs -ls s3a://{bucket}/