[SOLVED] Pig output of Multistorage to be appended when I run it everyday

Pig output of Multistorage to be appended when I run it everyday

I have a set of data on which I ran the multistorage command on column 'type' and now I have these paths in hdfs: "/output/type1/", "/output/type2/", "/output/type3/" etc.

Now, Everyday i run a script with multistorage command on column 'type' to produce "/tmp/type1/", "/tmp/type2/", "/tmp/type3/" etc (Types here could be either < or = the types in master output that is already present).

Since Pig doesn't allow me to provide the output path of an already existing directory, my script that runs everyday is /tmp/. Is there a way to combine /tmp/ with /output/, under the right 'type' subdirectories?

Expected to have /tmp/type1/file under /output/type1/ as /output/type1/file and so on.This way i can delete the /tmp and run the script again.

Any help is appreciated. Thanks in advance.

Solution

Pig cannot handle directories but invoking fs commands. Mapping temporary directories to the final directories requires more than what can Pig do. You can use the FileSystem Api in a tiny java program and run it separatly or in Oozie workflow.

In addition to that you need to ensure that appended files have different filenames than the existant ones, this is not the default behaviour and you can achieve it by this command:

 %declare timestamp `date +"%s"` 
 SET mapreduce.output.basename '$timestamp'
 /* here we used the timestamp to get unicity*/