pythonhashapache-pigudf

Returning tuple of unknown length from python UDF and then applying hash in Pig


This is a question that has two parts:

First, I have a python UDF that creates a list of strings of unknown length. The input to the UDF is a map (dict in python) and the number of keys is essentially unknown (it is what I'm trying to obtain).

What I don't know is how to output that in a schema that lets me return it as a list (or some other iterable data structure). This is what I have so far:

@outputSchema("?????") #WHAT SHOULD THE SCHEMA BE!?!?
def test_func(input):

    output = []
    for k, v in input.items():

        output.append(str(key))

    return output

Now, the second part of the question. Once in Pig I want to apply a SHA hash to each element in the "list" for all my users. Some Pig pseudo code:

USERS = LOAD 'something' as (my_map:map[chararray])
UDF_OUT = FOREACH USERS GENERATE my_udfs.test_func(segment_map)
SHA_OUT = FOREACH UDF_OUT GENERATE SHA(UDF_OUT)

The last line is likely wrong as I want to apply the SHA to each element in the list, NOT to the whole list.


Solution

  • To answer your question, since you are returning a python list who's contents are a string, you will want your decorator to be

    @outputSchema('name_of_bag:{(keys:chararray)}')
    

    It can be confusing when specifying this structure because you only need to define what one element in the bag would look like.

    That being said, there is a much simpler way to do what you require. There is a function KEYSET() (You can reference this question I answered) that will extract the keys from a Pig Map. So using the data set from that example and adding a few more keys to the first one since you said your map contents are variable in length

    maps
    ----
    [a#1,b#2,c#3,d#4,e#5]
    [green#sam,eggs#I,ham#am]
    

    Query:

    REGISTER /path/to/jar/datafu-1.2.0.jar;
    DEFINE datafu.pig.hash.SHA();
    
    A = LOAD 'data' AS (M:[]);
    B = FOREACH A GENERATE FLATTEN(KEYSET(M));
    hashed = FOREACH B GENERATE $0, SHA($0);
    DUMP hashed;
    

    Output:

    (d,18ac3e7343f016890c510e93f935261169d9e3f565436429830faf0934f4f8e4)
    (e,3f79bb7b435b05321651daefd374cdc681dc06faa65e374e38337b88ca046dea)
    (b,3e23e8160039594a33894f6564e1b1348bbd7a0088d42c4acb73eeaed59c009d)
    (c,2e7d2c03a9507ae265ecf5b5356885a53393a2029d241394997265a1a25aefc6)
    (a,ca978112ca1bbdcafac231b39a23dc4da786eff8147c4e72b9807785afee48bb)
    (ham,eccfe263668d171bd19b7d491c3ef5c43559e6d3acf697ef37596181c6fdf4c)
    (eggs,46da674b5b0987431bdb496e4982fadcd400abac99e7a977b43f216a98127721)
    (green,ba4788b226aa8dc2e6dc74248bb9f618cfa8c959e0c26c147be48f6839a0b088)