I am writing a message formatting parser that has the capability (among others) to parse links. This specific case requires parsing a link in the from of <url|linkname>
and replacing that text with just the linkname
. The issue here is that both url
or linkname
may or may not contain \1
or \2
characters anywhere in any order (at most one of each though). I want to match the pattern but keep the "invalid" characters. This problem solves itself for linkname
as that part of the pattern is just ([^\n+])
, but the url
fragment is matched by a much more complicated pattern, more specifically the URL validation pattern from is.js. It would not be trivial to modify the whole pattern manually to tolerate [\1\2]
everywhere, and I need the pattern to preserve those characters as they are used for tracking purposes (so I can't simply just .replace(/\1|\2/g, "")
before matching).
If this kind of matching is not possible, is there some automated way to reliably modify the RegExp to add [\1\2]{0,2}
between every character match, add \1\2
to all [chars]
matches, etc.
This is the url
pattern taken from is.js
:
/(?:(?:https?|ftp):\/\/)?(?:(?!(?:10|127)(?:\.\d{1,3}){3})(?!(?:169\.254|192\.168)(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,})))(?::\d{2,5})?(?:\/\S*)?/i
This pattern was adapted for my purposes and for the <url|linkname>
format as follows:
let namedUrlRegex = /<((?:(?:https?|ftp):\/\/)?(?:(?!(?:10|127)(?:\.\d{1,3}){3})(?!(?:169\.254|192\.168)(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,})))(?::\d{2,5})?(?:\/\S*)?)\|([^\n]+)>/ig;
The code where this is used is here: JSFiddle
Examples for clarification (...
represents the namedUrlRegex
variable from above, and $2
is the capture group that captures linkname
):
Current behavior:
"<googl\1e.com|Google>".replace(..., "$2") // "<googl\1e.com|Google>" WRONG
"<google.com|Goo\1gle>".replace(..., "$2") // "Goo\1gle" CORRECT
"<not_\1a_url|Google>".replace(..., "$2") // "<not_\1a_url|Google>" CORRECT
Expected behavior:
"<googl\1e.com|Google>".replace(..., "$2") // "Google" (note there is no \1)
"<google.com|Goo\1gle>".replace(..., "$2") // "Goo\1gle"
"<not_\1a_url|Google>".replace(..., "$2") // "<not_\1a_url|Google>"
Note the same rules for
\1
apply to\2
,\1\2
,\1...\2
,\2...\1
etcContext: This is used to normalize a string from a WYSIWYG editor to the length/content that it will display as, preserving the location of the current selection (denoted by
\1
and\2
so it can be restored after parsing). If the "caret" is removed completely (e.g. if the cursor was in the URL of a link), it will select the whole string instead. Everything works as expected, except for when the selection starts or ends in the url fragment.Edit for clarification: I only want to change a segment in a string if it follows the format of
<url|linkname>
whereurl
matches the URL pattern (tolerating\1
,\2
) andlinkname
consists of non-\n
characters. If this condition is not met within a<...|...>
string, it should be left unaltered as per thenot_a_url
example above.
I ended up making a RegEx that matches all "symbols" in the expression. One quirk of this is that it expects :
, =
, !
characters to be escaped, even outside of a (?:...)
, (?=...)
, (?!...)
expression. This is addressed by escaping them before processing.
let r = /(\\.|\[.+?\]|\w|[^\\\/\[\]\^\$\(\)\?\*\+\{\}\|\+\:\=\!]|(\{.+?\}))(?:((?:\{.+?\}|\+|\*)\??)|\??)/g;
let url = /((?:(?:https?|ftp):\/\/)?(?:(?!(?:10|127)(?:\.\d{1,3}){3})(?!(?:169\.254|192\.168)(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,})))(?::\d{2,5})?(?:\/\S*)?)/
function tolerate(regex, insert) {
let first = true;
// convert to string
return regex.toString().replace(/\/(.+)\//, "$1").
// escape :=!
replace(/((?:^|[^\\])\\(?:\\)*\(\?|[^?])([:=!]+)/g, (m, g1, g2) => g1 + (g2.split("").join("\\"))).
// substitute string
replace(r, function(m, g1, g2, g3, g4) {
// g2 = {...} multiplier (to prevent matching digits as symbols)
if (g2) return m;
// g3 = multiplier after symbol (must wrap in parenthesis to preserve behavior)
if (g3) return "(?:" + insert + g1 + ")" + g3;
// prevent matching tolerated characters at beginning, remove to change this behavior
if (first) {
first = false;
return m;
}
// insert the insert
return insert + m;
}
);
}
alert(tolerate(url, "\1?\2?"));