hivesql-insertparquetamazon-athena

athena insert and hive format error for HiveIgnoreKeyTextOutputFormat


Before the question/issue, here's the setup:

Table 1

CREATE EXTERNAL TABLE `table1`(
  `mac_address` string, 
  `node` string, 
  `wave_found` string, 
  `wave_data` string, 
  `calc_dt` string, 
  `load_dt` string)
PARTITIONED BY ( 
  `site_id` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://foobucket/object-thing'
TBLPROPERTIES (
  'has_encrypted_data'='false', 
  'transient_lastDdlTime'='1654609315')

Table 2

CREATE EXTERNAL TABLE `table2`(
  `mac_address` string, 
  `node` string, 
  `wave_found` string, 
  `wave_data` string, 
  `calc_dt` string, 
PARTITIONED BY ( 
  `load_dt` string, 
  `site_id` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://foobucket/object-thing'
TBLPROPERTIES (
  'has_encrypted_data'='false', 
  'transient_lastDdlTime'='1654147830')

When the following Athena SQL is executed, the error below is thrown:

insert into tabl2
select * from table1;

"HIVE_UNSUPPORTED_FORMAT: Output format org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat with SerDe org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe is not supported."

That error seems relatively straighforward, but I'm still stuck on building a solution despite looking for alternatives to the so-called HiveIgnoreKeyTextOutputFormat. There's also the partition difference going on, but I'm not sure if that has any bearing on this current error shown here.

Here's some sources I've found and used so far: 1, 2


Solution

  • I believe you should be using the following SerDe/format combination:

    ROW FORMAT SERDE 
        'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
    STORED AS INPUTFORMAT 
        'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
    OUTPUTFORMAT 
        'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
    

    You can also create table2 via CTAS, avoiding the manual creation/INSERT:
    Note that you should have separate/non-overlapping S3 paths for both

    create table table2
    with
    (
        format='parquet',
        parquet_compression='snappy',
        partitioned_by=array['load_dt', 'site_id'],
        external_location = 's3://foobucket/object-thing-table2/'
    )
    as
    select * from table1;
    

    If you now run show create table table2; you will see that CTAS has generated the MapredParquetInputFormat + MapredParquetOutputFormat