I'm storing a Hive table externally, and it's a pretty simple data structure. The table is created in Hive as
(user string, names array<string>)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' COLLECTION ITEMS TERMINATED BY '\001'
STORED AS TEXTFILE
(I've tried other delimiters, too).
In Pig, I can't seem to figure out the right way to use a bag or tuple to just load a simple array! Here's what I've tried without luck:
users = load '<file>' using PigStorage() AS (user:chararray, names:bag{tuple(name:chararray)})
users = load '<file>' using PigStorage() AS (user:chararray, names:chararray)
and some other things, but the best I've gotten was to have them loaded as a single string with the delimiter removed (which doesn't help). How do I just load a variable-length array of strings?
thanks
Let say you have the following data in the /user/hdfs/tester/ip/test file on HDFS
cat test:
1 A,B
2 C,D,E,F
3 G
4 H,I,J,K,L,M
In Pig Mapreduce do the following:
a = LOAD '/user/hdfs/tester/ip/test' USING PigStorage('\t') as (id:INT,names:chararray);
b = FOREACH a GENERATE id, FLATTEN(TOBAG(STRSPLIT(names,','))) as value:tuple(name:CHARARRAY);
The first column is id and value is the tuple of CHARARRAY.