[SOLVED] Pig script/command to filter a file on multiple strings

Pig script/command to filter a file on multiple strings

I am trying to write a Java program or Hadoop Pig script which will take a parameter of comma separated strings (e.g. abc, def, xyz) and filter a file for the records which contain one or more of these strings.

E.g.

Input File:

1    abctree
2    pqrwewe
3    rtrxyz45
4    abcxyz
5    234rt23

Input parameter is: abc, def, xyz

Expected output:

1    abctree
3    rtrxyz45
4    abcxyz

I am able to write the script which filters the file on 1 string, using matches, but don't know how to do that for multiple strings. Do I need to write a UDF for this?

I have added the Java tag to this question, because as per my initial findings I will have to write a UDF which will be written in Java. So if anyone knows a way to write this in Java, please post your solutions.

Solution

I have figured it out:

B = filter A by (n matches '.*string1.*' or n matches '.*string2.*' or n matches '.*string3.*');

This does the trick.

However, for my requirement, I will be accepting a "comma-separated" input from the command-line, e.g. string1, string2, string3. So the next task is to somehow separate individual strings and use them in the above expression. If anyone knows how to do it (especially without UDFs), please post.