I have following data set.
f1,f2,f3,f4,f5,f6
I am looking for count of f6 along with rest of the fields.
f1,f2,f3,f4,f5,5
f1,f2,f3,f4,f5,3
and so on.
I tried this code but it takes too long to execute
A = LOAD 'file a'
B = GROUP A BY f6
C = FOREACH B GENERATE FLATTEN (group) as f6, FLATTEN(f1), FLATTEN(f2),FLATTEN(f3),FLATTEN(f4),FLATTEN(f5),COUNT(f6)
Is there any better way to achieve what I am looking for ?
If I simply try to get count without flatten then fields end up in bag but I want final output as tuple.
So trying this gives me output as bag
C = FOREACH B GENERATE FLATTEN (group) as f6, A.f1,A.f2.A.f3,A.f4,A.f5, COUNT(f6)
All inputs are appreciated.
Cheers
It is also possible to flatten the projection which was grouped.
A = LOAD 'file a';
B = GROUP A BY f6;
C = FOREACH B GENERATE FLATTEN(A), COUNT(A) as f6_count;
EDIT 1: The key is using FLATTEN(A) instead of FLATTEN(group).
FLATTEN(A) will produce a tuple with all of the columns from the original relation and will get rid of the bag, even those that were not used in the group by statement (f1, f2, f3, f4, f5, f6).
FLATTEN(group) will return only columns used in the group by, in this case f6. Advantage of this approach is that it is very efficient and requires a single Map Reduce job to execute. Any solution that involves JOIN operation adds extra Map Reduce job.
As a general rule of thumb in pig, hive and MR, group by and join operations usually are executed as separate MR Jobs and reducing the number of MR jobs leads to improved performance.