regexvba

Why does the regex match words with only letters, when it should match words with both letters AND digits?


BACKGROUND
I want to replace words that contain only letters and digits, and both a letter and a digit, with whitespace. I am using VBA as shown in the example below.

See the proposed solution here: https://stackoverflow.com/a/7684859

QUESTION
Why does the regexp match the word "WhyIsThisWordMatched" when it doesn't contain a digit? And how can the regexp be fixed so it only matches words that contain both letters and digits, and only letters and digits?

Public Sub TestMe()
    Dim Rx As Object
    Dim Txt As String

    Set Rx = CreateObject("VBScript.RegExp")
    Rx.Global = True
    Rx.Pattern = "(^|\s)(?=.*[0-9])(?=.*[a-zA-Z])([a-zA-Z0-9]+)($|\s)"
    Txt = "WhyIsThisWordMatched XXX-111"
    Txt = Rx.Replace(Txt, " ")
    Debug.Print "Result: " & Txt
    ' Prints the string "  XXX-111"
End Sub

Solution

  • Lookaheads do not consume characters, they just assert whether a match is possible. The pattern (?=.*[0-9]) ensures that there is a digit somewhere ahead, and (?=.*[a-zA-Z]) ensures that there is a letter somewhere ahead, but it doesn't ensure that both exist in the same word.

    The current pattern (^|\s)(?=.*[0-9])(?=.*[a-zA-Z])([a-zA-Z0-9]+)($|\s) is matching any word of letters and digits that follows a space or the start of the string and precedes a space or the end of the string, as long as there is a digit somewhere and a letter somewhere in the string (not necessarily in the same word).

    This pattern should likely resolve the immediate issue:

    (^|\s)(?=\w*[0-9])(?=\w*[a-zA-Z])[a-zA-Z0-9]+($|\s)
    

    So your code would be something like:

    Public Sub TestMe()
        Dim Rx As Object
        Dim Txt As String
    
        Set Rx = CreateObject("VBScript.RegExp")
        Rx.Global = True
        Rx.Pattern = "(^|\s)(?=\w*[0-9])(?=\w*[a-zA-Z])[a-zA-Z0-9]+($|\s)"
        Txt = "WhyIsThisWordMatched XXX-111 abc123"
        Txt = Rx.Replace(Txt, " ")
        Debug.Print "Result: " & Txt
        ' Prints the string "WhyIsThisWordMatched XXX-111 "
    End Sub
    

    You could also use a negative lookahead, which might be better because it ensures the match is preceded by a non-word character (such as whitespace) or the start of the string.

    (?<!\S)(?=\S*[0-9])(?=\S*[a-zA-Z])\S+(?!\S)
    

    If you don't have to use lookaheads then using a word boundary would be the way to go:

    \b(?=\w*[0-9])(?=\w*[a-zA-Z])\w+\b