regexhivecreate-tablehive-serderegexserde

hive create table input.regex - filter out all rows starting with a char


I want to create table in Hive

CREATE TABLE table (
    a     string
   ,b     string
) 
PARTITIONED BY ( pr_filename string )
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES ('input.regex'='reg_exp') ;

but source data have multiline header starting with "#"

# <some comment>
#
# <some other comments>
# <some other comments>
# <some other comments>
#
a,b
1,2
8,2
8,9

Is it possible to write reg_exp to filter out all rows starting with chosen character or do I have to use temporary table to deal with this header?


Solution

  • If you try to filter like this:

    'input.regex'='^([^#]+),([a-zA-Z])' --first group is everything except #
    

    The row will be returned anyway with NULLs, you can filter such records.

    RegexSerDe JavaDocs says: In deserialization stage, if a row does not match the regex, then all columns in the row will be NULL. If a row matches the regex but has less than expected groups, the missing groups will be NULL. If a row matches the regex but has more than expected groups, the additional groups are just ignored

    The solution is to use intermediate table + filter rows when selecting from it.