amazon-web-servicesaws-glueaws-glue-data-catalog

Setting the number of decimal places when updating Glue Table Schema


I'm trying to update a CSV table definition that has been created by the Glue Data Crawler. One of the columns contains decimal data that is currently being classified as double precision.

I'm finding that when I use the console to change the schema, I can't set any additional properties that might be associated with the data types (for example, if I select Decimal I get Decimal(10,0) with no way to change the size of the number or the number of decimal places).

What's the recommended way to update this schema so that it has the correct data-type, including additional properties?


Solution

  • I had some problems setting a decimal on a Glue Table Schema recently. I had to create my schema via the AWS cli.

    What I had was a little different, it was a parquet on my s3 datalake.

    The following cli command creates the schema based on a json:

    aws glue create-table --database-name example_db --table-input file://example.json
    

    The following example.json references a parquet files on s3://my-datalake/example/{dt}/ where dt is a partition of my table. And dec_col is a column with decimal(10,2) type:

    {
        "Name": "example",
        "Retention": 0,
        "StorageDescriptor": {
            "Columns": [
              {
                "Name": "id",
                "Type": "int"
            },
            {
                "Name": "dec_col",
                "Type": "decimal(10,2)"
            }
            ],
            "Location": "s3://my-datalake/example/",
            "InputFormat": "org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat",
            "OutputFormat": "org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat",
            "Compressed": false,
            "NumberOfBuckets": 0,
            "SerdeInfo": {
                "SerializationLibrary": "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe",
                "Parameters": {
                    "serialization.format": "1"
                }
            },
            "SortColumns": [],
            "StoredAsSubDirectories": false
        },
        "PartitionKeys": [
            {
                "Name": "dt",
                "Type": "date"
            }
        ],
        "TableType": "EXTERNAL_TABLE",
        "Parameters": {
            "classification": "parquet"
        }
    }
    

    This way you can define the type as decimal with scale and precision, which is what you're looking for.