Let's say I have a table such as the one below, that may or may not contain duplicates for a given field:
ID URL
--- ------------------
001 http://example.com/adam
002 http://example.com/beth
002 http://example.com/beth?extra=blah
003 http://example.com/charlie
I would like to write a Pig script to find only DISTINCT rows, based on the value of a single field. For instance, filtering the table above by ID
should return something like the following:
ID URL
--- ------------------
001 http://example.com/adam
002 http://example.com/beth
003 http://example.com/charlie
The Pig GROUP BY
operator returns a bag of tuples grouped by ID, which would work if I knew how to get just the first tuple per bag (perhaps a separate question).
The Pig DISTINCT
operator works on the entire row, so in this case all four rows would be considered unique, which is not what I want.
For my purposes, I do not care which of the rows with ID 002
are returned.
I found one way to do this, using the GROUP BY
and the TOP
operators:
my_table = LOAD 'my_table_file' AS (A, B);
my_table_grouped = GROUP my_table BY A;
my_table_distinct = FOREACH my_table_grouped {
-- For each group $0 refers to the group name, (A)
-- and $1 refers to a bag of entire rows {(A, B), (A, B), ...}.
-- Here, we take only the first (top 1) row in the bag:
result = TOP(1, 0, $1);
GENERATE FLATTEN(result);
}
DUMP my_table_distinct;
This results in one distinct row per ID column:
(001,http://example.com/adam)
(002,http://example.com/beth?extra=blah)
(003,http://example.com/charlie)
I don't know if there is a better approach, but this works for me. I hope this helps others starting out with Pig.
(Reference: http://pig.apache.org/docs/r0.12.1/func.html#topx)