hadoopapache-pig

Sum of Salary in Apache Pig


Create employee and dept tables for the available files emp1.csv and dept.csv.

Colnames: Emp: Empno, name, sal, did, branch, dno Dept: deptno, name, loc

Retrieve total salaries to be paid for the employees working in ‘chicago’.

the table for emp was like

1010,jack,45000,CSE,10
1011,nick,70000,ECE,20
1012,mike,60000,ECE,30
1013,james,25000,CSE,20

and dept table was

10,ACCOUNTING,DALLAS
20,OPERATIONS,CHICAGO
30,SALES,BOSTON

I've joined both tables

grunt> emp_data = load ‘student/emp1.csv’ using PigStorage(‘,’) as (empno: int, empname: 
       chararray, sal: int, did: chararray, branch: chararray, dno: int);

grunt> emp_dept = load ‘student/dept.csv’ using PigStorage(‘,’) as (deptno: int, name: 
       chararray, loc: chararray);
grunt> joined = join emp_data by dno, emp_dept by deptno;
grunt> emp_loc = joined by loc matches 'CHICAGO';
grunt> total_sal = foreach emp_loc generate sum(sal);

After the last line it shows an error

EROR org.apache.pig.tools.grunt.Grunt - ERROR 1070: Could not resolve sum using import: [, java.lang., org.apache.pig.builtin., org.apache.pig.impl.builtin.]

The answer should be 95000


Solution

  • First, you need to remove did: chararray from the emp_data since that doesn't seem to be part of your data.

    Regarding the error, capitalization matters for built-in functions. Best to always capitalize all Pig keywords and functions, I've found.

    For the "explicit cast" error ... SUM (and other aggregate functions) takes a bag, and you are only passing an int, thus "none of them fit" (the method signatures of the SUM function).
    To get a bag, you need to GROUP. You can also improve the JOIN performance by pre-filtering the data. After the join and group, you need to provide the full identifier of the sal (seen from describe X)

    From docs

    SUM

    Computes the sum of the numeric values in a single-column bag. SUM requires a preceding GROUP ALL statement for global sums and a GROUP BY statement for group sums

    Same logic applies if you were to use SQL...

    Example

    E = LOAD 'emp.csv' USING PigStorage(',') AS (empno: int, empname: chararray, sal: int, branch: chararray, dno: int);
    D = LOAD 'dept.csv' USING PigStorage(',') AS (deptno: int, name: chararray, loc: chararray);
    D_CHI = FILTER D BY loc == 'CHICAGO';
    
    X = JOIN E BY dno, D_CHI BY deptno;
    
    X_BY_LOC = GROUP X BY D_CHI.loc;
    O = FOREACH X_BY_LOC GENERATE group, SUM(X.E::sal) as total;
    DUMP O;
    

    Yes, it is strange to filter then group by when you only have one entry after the filter, but it makes sense if you had more chicago values...

    Output

    (CHICAGO,95000)