regexhivebigdataregex-grouphive-serde

Unable to Parse string using Hive Regex Serde


I am trying to parse a string which is :

"297","298","Y","","299"

using Regexp serder but i am unable to do so. The Table definition i have created is :

create external table test.test1
(a string,
b string,
c string,
d string)
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
with serdeproperties ("input.regex" = "\"\"|\"([^\"]+)\"")

the regex used in the serde properties looks promising in the regexp test websites but i am getting exception while trying to read the table kindly help me out in this. I know that this can be easily done using csv serde but i am trying to figure out a bigger part of the problem for which i have to use the regexp serde Thanks


Solution

  • In the regex it should be capturing group per column. Your data contains 5 columns and table 4, you want to skip one column, right?

    For example this regex will work: with serdeproperties ('input.regex' = '^"(.*?)","(.*?)","(.*?)",.*?,"(.*?)"$')

    You can easily check without creating table, like this:

    select regexp_replace('"297","298","Y","","299"','^"(.*?)","(.*?)","(.*?)",.*?,"(.*?)"$','$1|$2|$3|$4');
    OK
    _c0
    297|298|Y|299
    
    select regexp_replace('"297","298","Y","this column is skipped","299"','^"(.*?)","(.*?)","(.*?)",.*?,"(.*?)"$','$1|$2|$3|$4');
    OK
    _c0
    297|298|Y|299