I have two columns id and segment. Segment is comma separated set of strings. I need to find average number of segments in all the table. One way to do it is by using two separate queries -
A - select count(*) from table_name;
B - select count(*) from table_name LATERAL VIEW explode(split(segment, ',') lTable AS singleSegment where segment != ""
avg = B/A
Answer would be 8/4 = 2 in the above case.
Is there a better way to achieve this ?
Try:
select sum(CASE segment
WHEN '' THEN 0
ELSE size(split(segment,','))
END
)*1.0/count(*) from table_name;
If your id field is unique, and you want to add a filter to the segment part, or protect against other malformed segment
values like a,b,
and a,,b
, you could do:
SELECT SUM(seg_size)*1.0/count(*) FROM (
SELECT count(*) as seg_size from table_name
LATERAL VIEW explode(split(segment, ',')) lTable AS singleSegment
WHERE trim(singleSegment) != ""
GROUP BY id
) sizes
Then you can add other stuff into the where clause.
But this query takes two Hive jobs to run, compared to one for the simpler query, and requires the id field to be unique.