csvhivehive-serdehiveddllazysimpleserde

What format applies to the Hive LazySimpleSerDe


What exactly is the format for Hive LazySimpleSerDe? A format like ParquetHiveSerDe tells me that Hive will read the HDFS files in parquet format.

But what is LazySimpleSerDe? Why not call it something explicit like CommaSepHiveSerDe or TabSepHiveSerDe, given LazySimpleSerDe is for delimited files?


Solution

  • LasySimpleSerde - fast and simple SerDe, it does not recognize quoted values, though it can work with different delimiters, not only commas, default is TAB (\t). You can specify STORED AS TEXTFILE in table DDL and LasySimpleSerDe will be used. For quoted values use OpenCSVSerDe, it is not as fast as LasySimpleSerDe but works correctly with quoted values.

    LasySimpleSerDe is simple for the sake of performance, also it creates Objects in a lazy way, to provide better performance, this is why it is preferable when possible (for text files).

    See this example with pipe-delimited (|) file format: https://stackoverflow.com/a/68095278/2700344

    show create table command for such table prints serde class as org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, STORED AS TEXTFILE is a shortcut.