pythonregex

Python regex for Java annotations


I'm trying to detect valid Java annotations in a text. Here's my test program (I'm currently ignoring all whitespace for simplicity, I'll add this later):

txts = ['@SomeName2',                   # match
        '@SomeName2(',                  # no match
        '@SomeName2)',                  # no match 
        '@SomeName2()',                 # match
        '@SomeName2()()',               # no match
        '@SomeName2(value)',            # no match
        '@SomeName2(=)',                # no match
        '@SomeName2("")',               # match
        '@SomeName2(".")',              # no match
        '@SomeName2(",")',              # match
        '@SomeName2(value=)',           # no match
        '@SomeName2(value=")',          # no match
        '@SomeName2(=3)',               # no match
        '@SomeName2(="")',              # no match
        '@SomeName2(value=3)',          # match
        '@SomeName2(value=3L)',         # match
        '@SomeName2(value="")',         # match
        '@SomeName2(value=true)',       # match
        '@SomeName2(value=false)',      # match
        '@SomeName2(value=".")',        # no match
        '@SomeName2(value=",")',        # match
        '@SomeName2(x="o_nbr ASC, a")', # match

        # multiple params:
        '@SomeName2(,value="ord_nbr ASC, name")',                            # no match
        '@SomeName2(value="ord_nbr ASC, name",)',                            # no match
        '@SomeName2(value="ord_nbr ASC, name"insertable=false)',             # no match
        '@SomeName2(value="ord_nbr ASC, name",insertable=false)',            # match
        '@SomeName2(value="ord_nbr ASC, name",insertable=false,length=10L)', # match

        '@SomeName2 ( "ord_nbr ASC, name", insertable = false, length = 10L )',       # match
       ]


#regex = '((?:@[a-z][a-z0-9_]*))(\((((?:[a-z][a-z0-9_]*))(=)(\d+l?|"(?:[a-z0-9_, ]*)"|true|false))?\))?$'
#regex = '((?:@[a-z][a-z0-9_]*))(\((((?:[a-z][a-z0-9_]*))(=)(\d+l?|"(?:[a-z0-9_, ]*)"|true|false))?(,((?:[a-z][a-z0-9_]*))(=)(\d+l?|"(?:[a-z0-9_, ]*)"|true|false))*\))?$'

regex = r"""
    (?:@[a-z]\w*)                               # @ + identifier (class name)
    (
      \(                                        # opening parenthesis
        (
          (?:[a-z]\w*)                          # identifier (var name)
          =                                     # assigment operator
          (\d+l?|"(?:[a-z0-9_, ]*)"|true|false) # either a numeric | a quoted string containing only alphanumeric chars, _, space | true | false
        )?                                      # optional assignment group
      \)                                        # closing parenthesis
    )?$                                         # optional parentheses group (zero or one)
    """


rg = re.compile(regex, re.VERBOSE + re.IGNORECASE)

for txt in txts:
    m = rg.search(txt)
    #m = rg.match(txt)
    if m:
        print "MATCH:   ",
        output = ''
        for i in xrange(2):
            output = output + '[' + str(m.group(i+1)) + ']'
        print output
    else:
        print "NO MATCH: " + txt

So basically what I have seems to work for zero or one parameters. Now I'm trying to extend the syntax to zero or more parameters, like in the last example.

I then copied the regex part that represents the assignment and prepend it by a comma for the 2nd to nth group (this group now using * instead of ?):

regex = '((?:@[a-z][a-z0-9_]*))(\((((?:[a-z][a-z0-9_]*))(=)(\d+l?|"(?:[a-z0-9_, ]*)"|true|false))?(,((?:[a-z][a-z0-9_]*))(=)(\d+l?|"(?:[a-z0-9_, ]*)"|true|false))*\))?$'

That cannot work however. The problem seems to be how to handle the first element, because the it must be optional, then strings like the first extension example '@SomeName2(,value="ord_nbr ASC, name")' would be accepted, which is wrong. I have no idea how to make the 2nd to nth assignment depend only on the presence of the first (optional) element.

Can it be done? Is it done that way? How do you best solve this?

Thanks


Solution

  • If you're just trying to detect valid syntax, I believe the regex below will give you the matches you want. But I'm not sure what you are doing with the groups. Do you want each parameter value in its own group as well? That will be harder, and I'm not even sure it's even possible with regex.

    regex = r'((?:@[a-z][a-z0-9_]*))(?:\((?!,)(?:(([a-z][a-z0-9_]*(=)(?:("[a-z0-9_, ]*")|(true|false)|(\d+l?))))(?!,\)),?)*\)(?!\()|$)'
    

    If you need the individual parameters/values, you probably need to write a real parser for that.

    EDIT: Here's a commented version. I also removed many of the capturing and non-capturing groups to make it easier to understand. If you use this with re.findall() it will return two groups: the function name, and all the params in parentheses:

    regex = r'''
    (@[a-z][a-z0-9_]*) # function name, captured in group
    (                  # open capture group for all parameters
    \(                 # opening function parenthesis 
      (?!,)            # negative lookahead for unwanted comma
      (?:              # open non-capturing group for all params
      [a-z][a-z0-9_]*  # parameter name
      =                # parameter assignmentoperators
      (?:"[a-z0-9_, ]*"|true|false|(?:\d+l?)) # possible parameter values
      (?!,\))          # negative lookahead for unwanted comma and closing parenthesis
      ,?               # optional comma, separating params
      )*               # close param non-capturing group, make it optional
    \)                 # closing function parenthesis 
    (?!\(\))           # negative lookahead for empty parentheses
    |$                 # OR end-of-line (in case there are no params)
    )                  # close capture group for all parameters
    '''
    

    After reading your comment about the parameters, the easiest thing will probably be to use the above regex to pull out all the parameters, then write another regex to pull out name/value pairs to do with as you wish. This will be tricky too, though, because there are commas in the parameter values. I'll leave that as an exercise for the reader :)