amazon-web-servicesaws-glueparquetamazon-kinesis-firehose

How to define AWS Glue table structure with embedded structs


To convert format in Kinesis Firehose from json to parquet you have to define the table structure in AWS Glue. For whatever reason it uses own custom format, where top level fields can be defined in JSON but the embedded types need to be defined in custom syntax.

The example from AWS console suggests to have something like

STRUCT <
  employer: STRING,
  id: BIGINT,
  address: STRING
>

But when I create it ad launch Kinesis process I have the error:

The schema is invalid. Error parsing the schema: Error: type expected at the position 16 of 'struct<software: string, session_id: string>' but ' string' is found."

Any suggestions how to make it work? (tried to ask ChatGPT, but got very similar answer that leads to the same error)


Solution

  • So, after back and forth it seems that the issue is really the space between semicolon and data type. So in fact the correct example should be

    STRUCT <
      employer:STRING,
      id:BIGINT,
      address:STRING>
    
    

    Update: space after colon is also harmful, so the format in JSON view should be with no spaces at all

    {
        "Name": "metadata",
        "Type": "struct<field1:string,field2:string,field3:string,field4:string,field4:string>",
        "Comment": ""
      }