It seems the regex engine used by Notepad++ can't do what I thought it would. Maybe a more general problem, not only with negation syntax.
Example with unicode number U+1F3B5, Unicode Name "Musical Note":
Successful regex WITHOUT negation that is already matching what I want:
videoPrimaryInfoRenderer":{"title":{"runs":\[{"text":"\K.+?(?=")
Example text that includes something I want to match:
}},"contents":{"twoColumnWatchNextResults":{"results":{"results":{"contents":[{"videoPrimaryInfoRenderer":{"title":{"runs":[{"text":"【みんなのリズム会場】ノリノリリズムパーティーはこちらです!🎵【天国】"}]},"viewCount":{"videoViewCountRenderer":
The part that I want to match (the regex above gets this):
【みんなのリズム会場】ノリノリリズムパーティーはこちらです!🎵【天国】
includes
🎵
emoji. So the "dot" DOES match characters above U+10000 in the last part of my regex:
.+?
and then the lookahead
(?=")
ends the match before the first ".
AFTER:
videoPrimaryInfoRenderer":{"title":{"runs":\[{"text":"\K
with the \K
modifier to "forget" about that part and select whatever I put at the end of that regex...
Example regex with negation I tried:
Example 1:
[^"]+
since there is no " in the part I want to match, matched:
【みんなのリズム会場】ノリノリリズムパーティーはこちらです!
Example 2:
((?!").)+
matched the same as example 1. Not surprising, since it's the same idea of excluding " but with negative lookahead.
Both types of "match OTHER THAN specified character" stop before the emoji.
Notepad++ v8.6.5 (32-bit)
I would appreciate an explanation.
TLDR:
You may use
videoPrimaryInfoRenderer":{"title":{"runs":\[{"text":"\K(?:(?!").[\x{DC00}-\x{DFFF}]?)+
You can refer to Regexp fails to match UTF-8 characters Notepad++ Community post:
Unfortunately, in character classes like you mentioned, that means that the characters outside the BMP (at U+10000 and above), while they can be found by
^.+
, cannot be found by something that seems equivalent, like^[\s\S]+
Problems arise when searching Unicode characters which are over the Basic Multilingual plane ( BMP ) which have a code-point between
\x{10000}
and\x{10FFFF}
( so over\x{FFFF}
)For instance, as the code-point of the emoticon 🤣 is over
\x{FFFF}
:
- It cannot be represented with its real regex syntax
\x{1F923}
, due a bug of the present Boost regex engine, which does not handle all characters in true 32-bits encoding, but only with the UTF-16 encoding:-(( So, searching for\x{1F4A6}
results in the error messageFind: Invalid regular expression
- Moreover, the simple regex dot symbol
(?-s).
cannot match a character, with Unicode code-point >\x{FFFF}
, too !- Of course if you paste your character, directly, in the Find what: zone, it does find all occurrences of the
ROLLING ON THE FLOOR LAUGHING
character !Luckily, the coding of characters of our Boost regex engine in UTF-16 allows to code all characters, with code-point over
\x{FFFF}
, thanks to the surrogates mechanism. Refer to generalities, below :https://en.wikipedia.org/wiki/UTF-16
In short, the surrogate pair of a character, with Unicode code-point in range from
\x{10000}
till\x{10FFFF}
, can be described by the regex :
\x{hhhh}\x{iiii}
whereD800 < hhhh < DBFF
andDC00 < iiii < DFFF
So if a regex, involves the surrogates pair ( two 16-bit units ) of a character, which is over the BMP, our regex engine is able to match it. For instance, as the surrogates pair of the character
ROLLING ON THE FLOOR LAUGHING
isD83E DD23
, the regex\x{D83E}\x{DD23}
does find all occurrences of your emoticon character !
For a full explanation about the two 16-bits code units, called a surrogates pair, refer to :
https://en.wikipedia.org/wiki/UTF-16#Code_points_from_U+010000_to_U+10FFFFFor the calculus of the surrogates pair of a specific character with code over
\x{FFFF}
, refer, either , to :
http://www.russellcottrell.com/greek/utilities/SurrogatePairCalculator.htm
http://www.cogsci.ed.ac.uk/~richard/utf-8.cgiOn our site, get additional information, here :
https://community.notepad-plus-plus.org/post/51068
https://community.notepad-plus-plus.org/post/43037and recently I proposed a Notepad++ macro which replaces any selection of the
\xhhhhh
syntaxes with their surrogate pair equivalents\x{Dhhh}\x{Diii}
! See below :
https://community.notepad-plus-plus.org/post/57528
The summary:
In summary, because of the use of UTF-16, instead of UTF-32, by the present implementation of the Boost Regex library, within N++ :
Use the simple regex
(?-s).
to match any standard character, from\x{0000}
to\x{FFFF}
( so not including the EOL chars nor the Form Feed char\x0c
)IMPORTANT : From the surrogates mechanism, explained above, one may think that the regex
[\x{D800}-\x{DBFF][\x{DC00}-\x{DFFF}]
should find all the characters with Unicode code-point over\x{FFFF}
. Unfortunately, this syntax does not work !? So, we need to use these derived regexes :
(?-s).[\x{DC00}-\x{DFFF}]
to match any standard character from\x{10000}
to\x{10FFFF}
(?-s).[\x{DC00}-\x{DFFF}]?
to match all standard characters, from\x{0000}
to\x{10FFFF}
And :
To match a specific character of the BMP, from
\x{0000}
to\x{FFFF}
, use the regex syntax\x{hhhh}
, with four hexadecimal numbersTo match a specific character over the BMP, from
\x{10000}
to\x{10FFFF}
, use the high and low surrogates equivalent pair, with the regex syntax\x{<high>}\x{<low>}
, replacing the<high>
and<low>
values with their exact hexadecimal values, using each 4 hexadecimal numbers