[SOLVED] Average count of elements in a set in hive?

Average count of elements in a set in hive?

I have two columns id and segment. Segment is comma separated set of strings. I need to find average number of segments in all the table. One way to do it is by using two separate queries -

A - select count(*) from table_name;
B - select count(*) from table_name LATERAL VIEW explode(split(segment, ',') lTable AS singleSegment where segment != ""
avg = B/A

Answer would be 8/4 = 2 in the above case.

Is there a better way to achieve this ?

Solution

Try:

select sum(CASE segment 
           WHEN '' THEN 0 
           ELSE  size(split(segment,','))
           END
           )*1.0/count(*) from table_name;

If your id field is unique, and you want to add a filter to the segment part, or protect against other malformed segment values like a,b, and a,,b, you could do:

SELECT SUM(seg_size)*1.0/count(*) FROM (
    SELECT count(*) as seg_size from table_name
    LATERAL VIEW explode(split(segment, ',')) lTable AS singleSegment
    WHERE trim(singleSegment) != ""
    GROUP BY id
) sizes

Then you can add other stuff into the where clause.

But this query takes two Hive jobs to run, compared to one for the simpler query, and requires the id field to be unique.