sqldatabasehiveclickstream

HIVE/SQL: Querying additive clickstream data


In the list functionality of our product, we have data storing the list of IDs that are showing up in a list. In a small set of our data (which has now been fixed), the IDs included in a list of results are stored in an additive manner. See the screenshot below to understand more of what I am trying to communicate.

Datetime            User              Action           Extra
5-20-2015 9:00AM     A                 list           ids: 1
5-20-2015 9:00AM     A                 list           ids: 1, 2   
5-20-2015 9:00AM     A                 list           ids: 1, 2, 3
5-20-2015 9:01AM     B                 list           ids: 5
5-20-2015 9:01AM     B                 list           ids: 5, 6
5-20-2015 9:01AM     B                 list           ids: 5, 6, 7

Is it possible to write a HIVE or SQL query against this type of data organization? I tried to query LENGTH(extra) in a subquery and then pull in the MAX(LENGTH(extra)) data, but we have been unable to get it to work.

What is the best way to work with this data so we are only pulling the FINAL list of IDs?


Solution

  • The row_number() window function may be what you're after (I often use it for queries against additive data sources where I'm ordering by date):

    select * from
    (select *, 
     row_number() over (partition by user order by length(extra) desc) rn
     from yourtable
     ) a
     where rn = 1