javascriptregexstring-parsingnull-character

Tolerate certain characters in RegEx


I am writing a message formatting parser that has the capability (among others) to parse links. This specific case requires parsing a link in the from of <url|linkname> and replacing that text with just the linkname. The issue here is that both url or linkname may or may not contain \1 or \2 characters anywhere in any order (at most one of each though). I want to match the pattern but keep the "invalid" characters. This problem solves itself for linkname as that part of the pattern is just ([^\n+]), but the url fragment is matched by a much more complicated pattern, more specifically the URL validation pattern from is.js. It would not be trivial to modify the whole pattern manually to tolerate [\1\2] everywhere, and I need the pattern to preserve those characters as they are used for tracking purposes (so I can't simply just .replace(/\1|\2/g, "") before matching).

If this kind of matching is not possible, is there some automated way to reliably modify the RegExp to add [\1\2]{0,2} between every character match, add \1\2 to all [chars] matches, etc.

This is the url pattern taken from is.js:

/(?:(?:https?|ftp):\/\/)?(?:(?!(?:10|127)(?:\.\d{1,3}){3})(?!(?:169\.254|192\.168)(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,})))(?::\d{2,5})?(?:\/\S*)?/i

This pattern was adapted for my purposes and for the <url|linkname> format as follows:

let namedUrlRegex = /<((?:(?:https?|ftp):\/\/)?(?:(?!(?:10|127)(?:\.\d{1,3}){3})(?!(?:169\.254|192\.168)(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,})))(?::\d{2,5})?(?:\/\S*)?)\|([^\n]+)>/ig;

The code where this is used is here: JSFiddle

Examples for clarification (... represents the namedUrlRegex variable from above, and $2 is the capture group that captures linkname):

Current behavior:
"<googl\1e.com|Google>".replace(..., "$2") // "<googl\1e.com|Google>" WRONG
"<google.com|Goo\1gle>".replace(..., "$2") // "Goo\1gle"              CORRECT
"<not_\1a_url|Google>".replace(..., "$2") // "<not_\1a_url|Google>"   CORRECT

Expected behavior:
"<googl\1e.com|Google>".replace(..., "$2") // "Google" (note there is no \1)
"<google.com|Goo\1gle>".replace(..., "$2") // "Goo\1gle"
"<not_\1a_url|Google>".replace(..., "$2") // "<not_\1a_url|Google>"

Note the same rules for \1 apply to \2, \1\2, \1...\2, \2...\1 etc

Context: This is used to normalize a string from a WYSIWYG editor to the length/content that it will display as, preserving the location of the current selection (denoted by \1 and \2 so it can be restored after parsing). If the "caret" is removed completely (e.g. if the cursor was in the URL of a link), it will select the whole string instead. Everything works as expected, except for when the selection starts or ends in the url fragment.

Edit for clarification: I only want to change a segment in a string if it follows the format of <url|linkname> where url matches the URL pattern (tolerating \1, \2) and linkname consists of non-\n characters. If this condition is not met within a <...|...> string, it should be left unaltered as per the not_a_url example above.


Solution

  • I ended up making a RegEx that matches all "symbols" in the expression. One quirk of this is that it expects :, =, ! characters to be escaped, even outside of a (?:...), (?=...), (?!...) expression. This is addressed by escaping them before processing.

    Fiddle

    let r = /(\\.|\[.+?\]|\w|[^\\\/\[\]\^\$\(\)\?\*\+\{\}\|\+\:\=\!]|(\{.+?\}))(?:((?:\{.+?\}|\+|\*)\??)|\??)/g;
    
    let url = /((?:(?:https?|ftp):\/\/)?(?:(?!(?:10|127)(?:\.\d{1,3}){3})(?!(?:169\.254|192\.168)(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,})))(?::\d{2,5})?(?:\/\S*)?)/
    
    function tolerate(regex, insert) {
        let first = true;
            // convert to string
        return regex.toString().replace(/\/(.+)\//, "$1").
            // escape :=!
            replace(/((?:^|[^\\])\\(?:\\)*\(\?|[^?])([:=!]+)/g, (m, g1, g2) => g1 + (g2.split("").join("\\"))).
            // substitute string
            replace(r, function(m, g1, g2, g3, g4) {
                // g2 = {...} multiplier (to prevent matching digits as symbols)
                if (g2) return m;
                // g3 = multiplier after symbol (must wrap in parenthesis to preserve behavior)
                if (g3) return "(?:" + insert + g1 + ")" + g3;
                // prevent matching tolerated characters at beginning, remove to change this behavior
                if (first) {
                    first = false;
                    return m;
                }
                // insert the insert
                return insert + m;
            }
        );
    }
    
    alert(tolerate(url, "\1?\2?"));