amazon-web-servicesamazon-athenapyarrowaws-glue-data-catalog

How does Amazon Athena manage rename of columns?


everyone!

I'm working on a solution that intends to use Amazon Athena to run SQL queries from Parquet files on S3. Those filed will be generated from a PostgreSQL database (RDS). I'll run a query and export data to S3 using Python's Pyarrow.

My question is: since Athena is schema-on-read, add or delete of columns on database will not be a problem...but what will happen when I get a column renamed on database?

Day 1: COLUMNS['col_a', 'col_b', 'col_c']

Day 2: COLUMNS['col_a', 'col_beta', 'col_c']

On Athena,

SELECT col_beta FROM table;

will return only data from Day 2, right? Is there a way that Athena knows about these schema evolution or I would have to run a script to iterate through all my files on S3, rename columns and update table schema on Athena from 'col_a' to 'col_beta'?

Would AWS Glue Data Catalog help in any way to solve this?

I'll love to discuss more about this!


Solution

  • I recommend reading more about handling schema updates with Athena here. Generally Athena supports multiple ways of reading Parquet files (as well as other columnar data formats such as ORC). By default, using Parquet, columns will be read by name, but you can change that to reading by index as well. Each way has its own advantages / disadvantages dealing with schema changes. Based on your example, you might want to consider reading by index if you are sure new columns are only appended to the end.

    A Glue crawler can help you to keep your schema updated (and versioned), but it doesn't necessarily help you to resolve schema changes (logically). And it comes at an additional cost, of course.

    Another approach could be to use a schema that is a superset of all schemas over time (using columns by name) and define a view on top of it to resolve changes "manually".