amazon-web-servicesamazon-iamaws-glueamazon-athenaaws-glue-data-catalog

Accessing datacatalog table in Glue properly


I created a table in Athena without a crawler from S3 source. It is showing up in my datacatalog. However, when I try to access it through a python job in Glue ETL, it shows that it has no column or any data. The following error pops up when accessing a column: AttributeError: 'DataFrame' object has no attribute '<COLUMN-NAME>'.

I am trying to access the dynamic frame following the glue way:

datasource = glueContext.create_dynamic_frame.from_catalog(
  database="datacatalog_database",
  table_name="table_name",
  transformation_ctx="datasource"
)

print(f"Count: {datasource.count()}")
print(f"Schema: {datasource.schema()}")

The above logs output: Count: 0 & Schema: StructType([], {}), where the Athena table shows I have around ~800,000 rows.

Sidenotes:


Solution

  • It looks like the S3 bucket has multiple nested folders inside it. For Glue to read these folders you need to add a flag adding additional_options = {"recurse": True} to your from_catalog(). This will help to recursively read records from s3 files.