javaregexhadoopapache-pigstring-matching

Pig script/command to filter a file on multiple strings


I am trying to write a Java program or Hadoop Pig script which will take a parameter of comma separated strings (e.g. abc, def, xyz) and filter a file for the records which contain one or more of these strings.

E.g.

Input File:

1    abctree
2    pqrwewe
3    rtrxyz45
4    abcxyz
5    234rt23

Input parameter is: abc, def, xyz

Expected output:

1    abctree
3    rtrxyz45
4    abcxyz

I am able to write the script which filters the file on 1 string, using matches, but don't know how to do that for multiple strings. Do I need to write a UDF for this?

I have added the Java tag to this question, because as per my initial findings I will have to write a UDF which will be written in Java. So if anyone knows a way to write this in Java, please post your solutions.


Solution

  • I have figured it out:

    B = filter A by (n matches '.*string1.*' or n matches '.*string2.*' or n matches '.*string3.*');
    

    This does the trick.

    However, for my requirement, I will be accepting a "comma-separated" input from the command-line, e.g. string1, string2, string3. So the next task is to somehow separate individual strings and use them in the above expression. If anyone knows how to do it (especially without UDFs), please post.