csvubuntuapache-pig

Counting the number of occurrences of unique values using Pig Latin


I am trying to figure out top 5 of the most downloaded RStudio packages on December 1, 2019 (from http://cran-logs.rstudio.com/) using Apache Pig Latin. The columns I need are 'r_os' and 'package'. Here is my code:

A = load '2019-12-01.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'NO_MULTILINE', 'UNIX', 'SKIP_INPUT_HEADER');
B = FOREACH A GENERATE r_os,package;
C = DISTINCT B;
D = GROUP C BY package;
result = FOREACH C GENERATE flatten($0), COUNT($1) as package_distr;

I'm getting the following result, which is wrong:

(magrittr,10)
(htmltools,10)
(httr,10)
(lubridate,10)
(ellipsis,10)

The number of occurrences should be more, not 10. My desired output should look approximately like:

(magrittr,10000)
(htmltools,9876)
(httr,8700)
(lubridate,5320)
(ellipsis,3000)

Any idea what I'm doing wrong?


Solution

  • result = FOREACH D GENERATE group, COUNT(C) as package_distr;
    

    ?

    group being the package name, and C being the name of the resulting bag when you grouped C, which we then count.