I am executing a pyspark application on AWS EMR that is configured to use AWS Glue Data Catalog as metastore. I have a table setup in AWS Glue that points to DynamoDB table. And now in my pyspark script, I am trying to access the Glue table. I am able to do show tables
and able to see the glue table. But when I try to query the table, I am getting below exception,
pyspark.sql.utils.AnalysisException: u'java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: arn:aws:dynamodb:<region>:<acct_id>:table/DDBTABLE;'
My query in pyspark script:
spark.sql("select * from ddbtable").show()
Couldn't find any good reference on this. I see people talking about issue with spark.sql.warehouse.dir
. But not sure how it is related to glue data catalog. Any inputs ?
Contacted AWS Tech and apparently this is an issue with EMR (as of 5.23.0) while using Glue data catalog and accessing Glue table that connects to DynamoDB. They are still working on this and meanwhile have provided below workaround.
Edit the properties file of the Glue table to include below,
update : Location property to some dummy S3 location so that it is of the form - s3://dummy-path
add : Add below DynamoDB specific information under parameters,
"dynamodb.table.name": "ddb-table",
"dynamodb.column.mapping": "col:col",
"storage_handler": "org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler"
For updating glue table refer here