amazon-redshiftamazon-redshift-spectrumaws-lake-formation

LakeFormation: Redshift spectrum can read read the the manifest file (for external table) but not the actual files?


I have an external table in Amazon Redshift:

create external table ruben.ruben_manifest_test
(
    customer_id bigint,
    external_cust_id varchar(30)
    
)
row format serde 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
with serdeproperties('serialization.format'='1')
stored as
inputformat 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
outputformat 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
location 's3://mybucket/folder1/redshift_external_table_location.txt';

The location redshift_external_table_location.txt is a manifest file with the following contents:

{
  "entries": [
    { "url": "s3://mybucket/folder2/file1.parquet",
      "meta": { "content_length": 8059990 } }
  ]
}

When I do select count(*) from ruben.ruben_manifest_test I get:


Spectrum Scan Error. Forbidden: HTTP response error code: 403 Message: AccessDenied Access Denied
x-amz-request-id: HBM2Q5DVA4WD7YJE
x-amz-id-2: OX8Z9+4NaKFLuDoQ7dWfAgtOy6KAN2ld+y2ksTo3Ela5xtgjonOQLEWdiZ5YuFq+PFNkSDfzOPs= (s3://mybucket/folder2/file1.parquet)

So obviously Redshift is able to read the manifest file (and obtain the s3 uri for the parquet file), but it's not able to read the parquet file due to Access Denied.

If I change the location for the external table to point to the folder containing the parquet file with alter table ruben.ruben_manifest_test set location 's3://mybucket/folder2/' then the select will work. I think that means that redshift is able to read the parquet files alright at least when the location is a folder, for some reason it does not work when the files are listed from a manifest file.

But I do not understand what credentials are being used to fetch the parquet file, because the iam roles associated with the redshift cluster are able to read those files for sure. As far as I understand the from the output of select * from svv_external_schemas is that the IAM role for the external schema ruben is arn:aws:iam::xxxxxxx:role/mySpectrumRole with a policy that allows glue:* and lakeformation:GetDataAccess. In AWS LakeFormation mySpectrumRole has access to the data location s3://mybucket/*

Any idea on what can be the cause or how to troubleshoot it? Specifically, is there any way to see what iam role is being used for the second S3 request (the one for the parquet file)?

It seems to me that Redshift asks LakeFormation (GetDataAccess) for credentials and gets credentials narrowed down to access just the table location (which is just a manifest files) instead of getting credentials valid for both the manifest and the data files.


Solution

  • You can define databases and tables in the Glue catalog on top of manifest files and then define an external schema in Redshift Spectrum on top of the Database in Glue Catalog. We are using this approach with Lake Formation enabled for all tables and everything works as expected. The only issue with Redshift - it is very sensitive to 'broken' partitions. Athena can skip partitions that are not pointing to the 'real' data but Redshift is failing to do that.