I am trying to write a Java program or Hadoop Pig script which will take a parameter of comma separated strings (e.g. abc, def, xyz) and filter a file for the records which contain one or more of these strings.
E.g.
Input File:
1 abctree
2 pqrwewe
3 rtrxyz45
4 abcxyz
5 234rt23
Input parameter is: abc, def, xyz
Expected output:
1 abctree
3 rtrxyz45
4 abcxyz
I am able to write the script which filters the file on 1 string, using matches, but don't know how to do that for multiple strings. Do I need to write a UDF for this?
I have added the Java tag to this question, because as per my initial findings I will have to write a UDF which will be written in Java. So if anyone knows a way to write this in Java, please post your solutions.
I have figured it out:
B = filter A by (n matches '.*string1.*' or n matches '.*string2.*' or n matches '.*string3.*');
This does the trick.
However, for my requirement, I will be accepting a "comma-separated" input from the command-line, e.g. string1, string2, string3. So the next task is to somehow separate individual strings and use them in the above expression. If anyone knows how to do it (especially without UDFs), please post.