hivehive-serde

character slash is not being read by hive on using OpenCSVSerde


I have defined a table on top of files present in hdfs. I am using the OpenCSV Serde to read from the file. But, '\' slash characters in the data are getting omitted in the final result set.

Is there a hive serde property that I am not using correctly. As per the documentation, escapeChar = '\' should fix this problem. But, the problem persists.

   CREATE EXTERNAL TABLE `tsr`(
    `last_update_user` string COMMENT 'from deserializer',
    `last_update_datetime` string COMMENT 'from deserializer')
    ROW FORMAT SERDE
    'org.apache.hadoop.hive.serde2.OpenCSVSerde'
    WITH SERDEPROPERTIES (
    'escapeChar'='\',
    'quoteChar'='\"',
    'separatorChar'=',',
    'serialization.encoding'='UTF-8')
    STORED AS INPUTFORMAT
    'org.apache.hadoop.mapred.TextInputFormat'
    OUTPUTFORMAT
    'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
    LOCATION
    'hdfs://edl/hive/db/tsr'
    TBLPROPERTIES (
    'COLUMN_STATS_ACCURATE'='{\"BASIC_STATS\":\"true\"}',
    'numFiles'='1',
    'numRows'='1869',
    'rawDataSize'='0',
    'serialization.null.format'='',
    'totalSize'='144640',
    'transient_lastDdlTime'='1524479930')

Sample Output:

DomainUser1 , 2017-07-04 19:07:27

Expected Result:

Domain\User1 , 2017-07-04 19:07:27

EDIT 1: I have tried both '\\' and '\' as the escapeChar and both have the same problem


Solution

  • Unfortunately the csv serde in Hive does not support multiple characters as separator/quote/escape, it looks like you want to use 2 backlslahes as escapeChar (which is not possible) consideirng than OpenCSVSerde only support a single character as escape (actually it is using CSVReader which only supports one). I am not aware about any other SerDe that supports multiple characters in Hive, you can always implement your own udf with other library, not the most popular option (nobody wants to support its own stuffs :)). I would recommend use a different character as escape, hopefully one not present in your data. A second option would be modify your data during your ingestion to replace \ by \\