amazon-web-servicesaws-glueparquetamazon-athenasnappy

Easiest way to remap column headers in Glue/Athena?


Data has headers like _col_0, _col_1, etc. I have a sample data file that has the correct column headers.

However, all the data is in snappy/parquet across ~250 files.

What is the easiest way to remap the column headers in Glue?

Thanks.

UPDATE:

So I tried John R's comment below - I went in and edited the table schema in glue, to rename the columns, but when I query the data, any column I edited is now missing data.

I tried to re-run the glue job and it overwrote the edited schema (makes sense) but the data is back.

So editing the column name in the schema makes that data get dropped, or not applied to the column. Searched on google, but don't see any related issues.

UPDATE #2:

If I rename the column back to it's original name (which is in the snappy/parquet files) then the data comes back.

UPDATE #3:

I basically solved this by generating a view in Athena and renaming the _col_0... columns to the their correct names.


Solution

  • I would say that you can't rename columns from tables stored in parquet because the schema is contained in the file itself. If you add a file with another header than glue will threat it in one of the 3 ways when running crawlers: ignore changes, add new columns or create new tables; depends on how it's set up.

    What you can do is recreate all the files in another bucket with the desired column names and create another table from it, then delete the old files if you want.

    Altough this might sound a little difficulty it is not. You can leverage the Athena's UNLOAD command (also available on Redshift) to do so. This is also a good moment to (re)partition your files if you see fit. I will leave an example here, you can than customize the command for your own needs. Check the docs for further reference: UNLOAD - Amazon Athena

    UNLOAD (SELECT _col_0 AS new_col0_name, _col_1 AS new_col1_name FROM table) 
    TO 's3://yourbucket/your_partitioned_table/' 
    WITH (format = 'PARQUET', compression = 'SNAPPY', partitioned_by = ARRAY['new_col1_name'])
    

    Redshift UNLOAD is more robust, so if you need more options, like controling max file size, you can do it by leveraging Redshift Spectrum to query data directly from S3.