arrayshadoopapache-pig

Reading array of strings from file with Apache Pig


I'm storing a Hive table externally, and it's a pretty simple data structure. The table is created in Hive as

(user string, names array<string>)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' COLLECTION ITEMS TERMINATED BY '\001'
STORED AS TEXTFILE

(I've tried other delimiters, too).

In Pig, I can't seem to figure out the right way to use a bag or tuple to just load a simple array! Here's what I've tried without luck:

users = load '<file>' using PigStorage() AS (user:chararray, names:bag{tuple(name:chararray)})

users = load '<file>' using PigStorage() AS (user:chararray, names:chararray)

and some other things, but the best I've gotten was to have them loaded as a single string with the delimiter removed (which doesn't help). How do I just load a variable-length array of strings?

thanks


Solution

  • Let say you have the following data in the /user/hdfs/tester/ip/test file on HDFS

    cat test:
    1   A,B
    2   C,D,E,F
    3   G
    4   H,I,J,K,L,M
    

    In Pig Mapreduce do the following:

    a = LOAD '/user/hdfs/tester/ip/test' USING PigStorage('\t') as (id:INT,names:chararray);
    b = FOREACH a GENERATE id, FLATTEN(TOBAG(STRSPLIT(names,','))) as value:tuple(name:CHARARRAY);
    

    The first column is id and value is the tuple of CHARARRAY.