hiveaggregate-functionsexplodehiveqlapache-hive

Average count of elements in a set in hive?


enter image description here

I have two columns id and segment. Segment is comma separated set of strings. I need to find average number of segments in all the table. One way to do it is by using two separate queries -

A - select count(*) from table_name;
B - select count(*) from table_name LATERAL VIEW explode(split(segment, ',') lTable AS singleSegment where segment != ""
avg = B/A

Answer would be 8/4 = 2 in the above case.

Is there a better way to achieve this ?


Solution

  • Try:

    select sum(CASE segment 
               WHEN '' THEN 0 
               ELSE  size(split(segment,','))
               END
               )*1.0/count(*) from table_name;
    

    If your id field is unique, and you want to add a filter to the segment part, or protect against other malformed segment values like a,b, and a,,b, you could do:

    SELECT SUM(seg_size)*1.0/count(*) FROM (
        SELECT count(*) as seg_size from table_name
        LATERAL VIEW explode(split(segment, ',')) lTable AS singleSegment
        WHERE trim(singleSegment) != ""
        GROUP BY id
    ) sizes
    

    Then you can add other stuff into the where clause.

    But this query takes two Hive jobs to run, compared to one for the simpler query, and requires the id field to be unique.