group-byapache-pigdistinct

In Apache Pig, select DISTINCT rows based on a single column


Let's say I have a table such as the one below, that may or may not contain duplicates for a given field:

ID     URL
---    ------------------
001    http://example.com/adam
002    http://example.com/beth
002    http://example.com/beth?extra=blah
003    http://example.com/charlie

I would like to write a Pig script to find only DISTINCT rows, based on the value of a single field. For instance, filtering the table above by ID should return something like the following:

ID     URL
---    ------------------
001    http://example.com/adam
002    http://example.com/beth
003    http://example.com/charlie

The Pig GROUP BY operator returns a bag of tuples grouped by ID, which would work if I knew how to get just the first tuple per bag (perhaps a separate question).

The Pig DISTINCT operator works on the entire row, so in this case all four rows would be considered unique, which is not what I want.

For my purposes, I do not care which of the rows with ID 002 are returned.


Solution

  • I found one way to do this, using the GROUP BY and the TOP operators:

    my_table = LOAD 'my_table_file' AS (A, B);
    
    my_table_grouped = GROUP my_table BY A;
    
    my_table_distinct = FOREACH my_table_grouped {
    
        -- For each group $0 refers to the group name, (A)
        -- and $1 refers to a bag of entire rows {(A, B), (A, B), ...}.
        -- Here, we take only the first (top 1) row in the bag:
    
        result = TOP(1, 0, $1);
        GENERATE FLATTEN(result);
    
    }
    
    DUMP my_table_distinct;
    

    This results in one distinct row per ID column:

    (001,http://example.com/adam)
    (002,http://example.com/beth?extra=blah)
    (003,http://example.com/charlie)
    

    I don't know if there is a better approach, but this works for me. I hope this helps others starting out with Pig.

    (Reference: http://pig.apache.org/docs/r0.12.1/func.html#topx)