hadoopapache-pig

Use REPLACE in multicondition RegExp in Pig Latin


I have this set of data:

dump data;

This is a sample output: (this dataset is almost a million rows long).

("0",60,0,1,"Fri")
("1",47,0,1,"Mon")
("1",23,1,0,"Tue")
("1",60,0,0,"Sat")
("1",50,1,1,"Fri")

I want to replace the values: Sat, Fri, Mon to numbers of week, I know how to use REPLACE for change just 1 value at a time, but I have to repeat it multiple times in order to change all days of the week:

data_day_of_week = FOREACH data GENERATE $0,$1,$2,$3,REPLACE($4, 'Mon', '1');

Is there any way to do this in only one statement?


Solution

  • saph_top, was the closer one to answer my question, nonetheless after testing it was resolving blank output, I'm going to complement his answer:

    'Mon' is not the same as "Mon", therefore when I was using: CASE WHEN $4 == 'Mon' THEN '1' It wasn´t replacing anything, resulting in blank result in: data_day_of_week.

    To solve this, I just add " " (double quotes to the condition):

    data_day_of_week = FOREACH data GENERATE
    CASE
         WHEN $4 == '"Mon"' THEN '1'
         WHEN $4 == '"Tue"' THEN '2'
         ...
         WHEN $4 == '"Sun"' THEN '7'
    END AS day_number;
    

    After that in order to rebuild the data I add the following to GENERATE Clause:

    data_day_of_week = FOREACH data GENERATE $0,$1,$2,$3,
        CASE
             WHEN $4 == '"Mon"' THEN '1'
             WHEN $4 == '"Tue"' THEN '2'
             ...
             WHEN $4 == '"Sun"' THEN '7'
        END AS day_number;
    

    And the output was complete now: dump data_day_of_week;

    ("0",60,0,1,5)
    ("1",47,0,1,1)
    ("1",23,1,0,2)
    ("1",60,0,0,6)
    ("1",50,1,1,5)