I want to create table in Hive
CREATE TABLE table (
a string
,b string
)
PARTITIONED BY ( pr_filename string )
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES ('input.regex'='reg_exp') ;
but source data have multiline header starting with "#"
# <some comment>
#
# <some other comments>
# <some other comments>
# <some other comments>
#
a,b
1,2
8,2
8,9
Is it possible to write reg_exp to filter out all rows starting with chosen character or do I have to use temporary table to deal with this header?
If you try to filter like this:
'input.regex'='^([^#]+),([a-zA-Z])' --first group is everything except #
The row will be returned anyway with NULLs, you can filter such records.
RegexSerDe JavaDocs says: In deserialization stage, if a row does not match the regex, then all columns in the row will be NULL. If a row matches the regex but has less than expected groups, the missing groups will be NULL. If a row matches the regex but has more than expected groups, the additional groups are just ignored
The solution is to use intermediate table + filter rows when selecting from it.