I'm trying to read data from an iceberg table, the data is in ORC format and partitioned by column. I'm getting this error -
AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to fetch table temp_tag_thrshld_iceberg. StorageDescriptor#InputFormat cannot be null for table: temp_tag_thrshld_iceberg (Service: null; Status Code: 0; Error Code: null; Request ID: null; Proxy: null)
This is my code :
spark = SparkSession.builder.config("spark.driver.memory", "25g").appName(app_name).getOrCreate()
temp_tag_thrshld_data = spark.sql("SELECT * FROM dev_db.temp_tag_thrshld_iceberg")
If I replace my spark.sql("Select * from a_normal_athena_table) the code runs fine. I'm also not able to read the data directly from S3 as its an ORC format with Snappy compression so I don't get any results (I'm probably missing the correct framework to read S3 ORC directly but that's another issue for another day)
I've tried validating my table using
aws glue get-table --database-name dev_db --name temp_tag_thrshld_iceberg
and this is the output I got -
{ "Table": { "Name": "temp_tag_thrshld_iceberg", "DatabaseName": "dev_db", "CreateTime": 1658864256.0, "UpdateTime": 1658864347.0, "Retention": 0, "StorageDescriptor": { "Columns": [ { "Name": "tag", "Type": "int", "Parameters": { "iceberg.field.current": "true", "iceberg.field.id": "1", "iceberg.field.optional": "true" } }, { "Name": "zipcode", "Type": "int", "Parameters": { "iceberg.field.current": "true", "iceberg.field.id": "2", "iceberg.field.optional": "true" } }, { "Name": "threshold_max", "Type": "double", "Parameters": { "iceberg.field.current": "true", "iceberg.field.id": "3", "iceberg.field.optional": "true" } }, { "Name": "level", "Type": "string", "Parameters": { "iceberg.field.current": "true", "iceberg.field.id": "4", "iceberg.field.optional": "true" } } ], "Location": "s3://dev_db/athena-tables/temp_tag_thrshld_iceberg", "Compressed": false, "NumberOfBuckets": 0, "SortColumns": [], "StoredAsSubDirectories": false }, "TableType": "EXTERNAL_TABLE", "Parameters": { "metadata_location": "s3://dev_db/athena-tables/temp_tag_thrshld_iceberg/metadata/00001-0ee5fbc7-044e-439d-aa1e-d76935002ebd.metadata.json", "previous_metadata_location": "s3://dev_db/athena-tables/temp_tag_thrshld_iceberg/metadata/00000-3a8f33f0-fbef-48c3-b289-6021f62b8b8c.metadata.json", "table_type": "ICEBERG" }, "CreatedBy": "IAM Details", "IsRegisteredWithLakeFormation": false, "CatalogId": "571708111280", "VersionId": "1" } }
Updated the config to this (based on iceberg table configuration):
spark = SparkSession.builder.config("spark.driver.memory", "25g")
.config("spark.sql.catalog.spark_catalog", "org.apache.iceberg.spark.SparkSessionCatalog")
.config("spark.sql.catalog.spark_catalog.type", "hive")
.appName(app_name).getOrCreate()
I'm getting this new error -
An error occurred while calling o87.sql. Cannot find catalog plugin class for catalog 'spark_catalog': org.apache.iceberg.spark.SparkSessionCatalog
To read Iceberg tables in Glue you have to use the Apache Iceberg Connector for AWS Glue:
https://aws.amazon.com/marketplace/pp/prodview-iicxofvpqvsio
And below is a blog for your reference which talks about fetching data from iceberg with AWS Glue in detail