hiveamazon-athenahive-serderegexserdepresto

Can we create several entries from one line?


My logs look like this: client_id;event_1;event_2;event3

And i would like to get an SQL Table like this:

 client_id |   event
 ---------------------
    ...    |  event_1
    ...    |  event_2
    ...    |  event_3

I am new to Hive, it seems to me that one log line always provides one entry in the resulting SQL table. I tried the following (unsuccessful):

CREATE EXTERNAL TABLE IF NOT EXISTS tablename (
         client_id String,
         `event` String
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
         "input.regex" = "^([^\;]+);.*([^\;]+).*$" )
LOCATION 's3://myBucket/prefix/';

It takes only the first event and ignore the others...


Solution

  • Unfortunately, it is not possible to generate rows using SerDe in table DDL. It's possible to do the same in Hive.

    (1) Read all user events as a single column:

    CREATE EXTERNAL TABLE IF NOT EXISTS tablename (
             client_id String,
             events    String
    )
    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
    WITH SERDEPROPERTIES (
     "input.regex" = "^([^\\;]+)\\;(.*)$" )
    LOCATION 's3://myBucket/prefix/';
    

    Check, it should read two columns, user_id and all events concatenated:

    'client_id' and 'event_1;event_2;event3'

    (2) Split events and explode to generate rows:

     select t.user_id, e.event
       from tablename t
            lateral view outer explode(split(t.events,'\\;')) e as event;
    

    Read also about Lateral View.

    In Athena use UNNEST with CROSS JOIN:

    select t.user_id, e.event
           from tablename t
           CROSS JOIN UNNEST(SPLIT(t.events,';')) AS e (event)