regexalgorithmstring-matching

Matching algorithm or regular expression?


I have a huge log file with different types of string rows, and I need to extract data in a "smart" way from these.

Sample snippet:

2011-03-05 node32_three INFO stack trace, at empty string asfa 11120023
--- - MON 23 02 2011 ERROR stack trace NONE      

For instance, what is the best way to extract the date from each row, independent of date format?


Solution

  • You could make a regex for different formats like so:

     (fmt1)|(fmt2)|....
    

    Where fmt1, fmt2 etc are the individual regexes, for yor example

    (20\d\d-[01]\d-[0123]\d)|((?MON|TUE|WED|THU|FRI|SAT|SUN) [0123]\d [01]\d 20\d\d)
    

    Note that to prevent the chance to match arbitrary numbers I restricted year, month and day numbers accordingly. For example, a day number cannot start with 4, neither can a month number start with 2.

    This gives the following pseudo code:

    // remember that you need to double each backslash when writing the
    // pattern in string form
    Pattern p = Pattern.compile("...");    // compile once and for all
    String s;
    for each line 
        s = current input line;
        Matcher m = p.matcher(s);
        if (m.find()) {
            String d = m.group();    // d is the string that matched
            ....
        }
    

    Each individual date pattern is written in () to make it possible to find out what format we had, like so:

            int fmt = 0;
            // each (fmt) is a group, numbered starting with 1 from left to right
            for (int i = 1; fmt == 0 && i <= total number of different formats; i++) 
                if (m.group(i) != null) fmt = i;
    

    For this to work, inner (regex) groups must be written (?regex) so that they do not count as capture-groups, look at updated example.