I'm trying to update a CSV table definition that has been created by the Glue Data Crawler. One of the columns contains decimal data that is currently being classified as double precision.
I'm finding that when I use the console to change the schema, I can't set any additional properties that might be associated with the data types (for example, if I select Decimal I get Decimal(10,0) with no way to change the size of the number or the number of decimal places).
What's the recommended way to update this schema so that it has the correct data-type, including additional properties?
I had some problems setting a decimal on a Glue Table Schema recently. I had to create my schema via the AWS cli.
What I had was a little different, it was a parquet on my s3 datalake.
The following cli command creates the schema based on a json:
aws glue create-table --database-name example_db --table-input file://example.json
The following example.json
references a parquet files on s3://my-datalake/example/{dt}/
where dt
is a partition of my table. And dec_col
is a column with decimal(10,2)
type:
{
"Name": "example",
"Retention": 0,
"StorageDescriptor": {
"Columns": [
{
"Name": "id",
"Type": "int"
},
{
"Name": "dec_col",
"Type": "decimal(10,2)"
}
],
"Location": "s3://my-datalake/example/",
"InputFormat": "org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat",
"OutputFormat": "org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat",
"Compressed": false,
"NumberOfBuckets": 0,
"SerdeInfo": {
"SerializationLibrary": "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe",
"Parameters": {
"serialization.format": "1"
}
},
"SortColumns": [],
"StoredAsSubDirectories": false
},
"PartitionKeys": [
{
"Name": "dt",
"Type": "date"
}
],
"TableType": "EXTERNAL_TABLE",
"Parameters": {
"classification": "parquet"
}
}
This way you can define the type as decimal
with scale and precision, which is what you're looking for.