I'm working in the grunt shell while working on Pig.
I have table A
with colA
.
I want to group table A
by colA
and store this in file grACount
, and the filter results of grACount
and store the filtered results in a file called grACountFilter
.
If I write statements like the following in the grunt shell:
grA = GROUP A BY colA;
grACount = FOREACH grA GENERATE group as colA, COUNT(A.colA) as countColA;
STORE grACount into 'grACount';
grACountFilter = FILTER grACount BY countColA>15;
STORE grACountFilter into 'grACountFilter';
Then it will submit a map reduct job for line 3 and then again for line 5 right?
And, when it submits the job again for line 5, it will recompute the tables, right?
What I want is to not have to submit two different map reduce jobs and have all the computations performed in one go. Is this possible?
The STORE and DUMP commands in pig will trigger the job execution. So you can't block that behavior. You can keep all the STORE commands together. A single STORE or DUMP command may trigger multiple mapreduce jobs.
The execution plan will be executed once the script reaches STORE or DUMP command. The number of jobs depends on the execution plan.