amazon-web-servicesapache-sparkaws-glueamazon-emrapache-iceberg

Apache Iceberg tables not working with AWS Glue in AWS EMR


I'm trying to load a table in na spark EMR cluster from glue catalog in apache iceberg format that is stored in S3. The table is correctly created because I can query it from AWS Athena. On the cluster creation I have set this configuration:

[{"classification":"iceberg-defaults","properties":{"iceberg.enabled":"true"}}]

IK have tried running sql queries from spark that are in other formats(csv) and it works, but when I try to read iceberg tables I get this error:

org.apache.hadoop.hive.ql.metadata.HiveException: Unable to fetch table table_name. StorageDescriptor#InputFormat cannot be null for table: table_name(Service: null; Status Code: 0; Error Code: null; Request ID: null; Proxy: null)

This is the code in the notebook:

%%configure -f
{
"conf":{
    "spark.sql.extensions":"org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",
    "spark.sql.catalog.dev":"org.apache.iceberg.spark.SparkCatalog",
    "spark.sql.catalog.dev.type":"hadoop",
    "spark.sql.catalog.dev.warehouse":"s3://pyramid-streetfiles-sbx/iceberg_test/"
    }
}

from pyspark.conf import SparkConf
from pyspark.sql import SparkSession
import pyspark.sql.functions as f
import pyspark.sql.types as t

spark = SparkSession.builder.getOrCreate()

# This query works and shows the iveberg table i want to read
spark.sql("show tables from iceberg_test").show(truncate=False)

# Here shows the error
spark.sql("select * from iceberg_test.table_name limit 10").show(truncate=False)

How can I read apache iceberg tables in EMR cluster with Spark and glue catalog?


Solution

  • You need to pass the catalog name glue.

    Example: glue_catalog.<your_database_name>.<your_table_name>

    https://docs.aws.amazon.com/pt_br/glue/latest/dg/aws-glue-programming-etl-format-iceberg.html