I'm trying to remove every Unicode character in a string if it falls in any the ranges below.
\uD800-\uDFFF
\u1D800-\u1DFFF
\u2D800-\u2DFFF
\u3D800-\u3DFFF
\u4D800-\u4DFFF
\u5D800-\u5DFFF
\u6D800-\u6DFFF
\u7D800-\u7DFFF
\u8D800-\u8DFFF
\u9D800-\u9DFFF
\uAD800-\uADFFF
\uBD800-\uBDFFF
\uCD800-\uCDFFF
\uDD800-\uDDFFF
\uED800-\uEDFFF
\uFD800-\uFDFFF
\u10D800-\u10DFFF
As an initial prototype, I tried to just remove characters within the first range by using a regex in the replace
function.
var buffer = "he\udfffllo world";
var output = buffer.replace(/[\ud800-\udfff]/g, "");
d.innerText = buffer + " is replaced with " + output;
In this case, the character seems to have been replaced fine.
However, when I replace that with
var buffer = "he\udfffllo worl\u1dfffd";
var output = buffer.replace(/[\ud800-\udfff\u1d800-\u1dfff]/g, "");
d.innerText = buffer + " is replaced with " + output;
I see something unexpected. My output shows up as:
he�llo worl᷿fd is replaced with
There are two things to note here:
\u1dfff
does not show up as one character - \u1dff
gets converted to a character and the f
at the end it treated as its own characterAny suggestions on how I can accomplish this would be much appreciated.
EDIT
My overall goal is to filter out all characters that the encodeURIComponent
function considers invalid. I ran some tests and found the list above to be the set of characters that a invalid. For instance, the code below, which first converts 1dfff
to a unicode character before passing that to encodeURIComponent
causes an exception to be raised by the latter function.
var v = String.fromCharCode(122879);
var uriComponent = encodeURIComponent(v);
I edited parts of the question after @Blender pointed out that i was using x
instead of u
in my code to represent Unicode characters.
EDIT 2
I investigated my technique for fetching the "invalid" unicode ranges further, and as it turns out, if you give String.fromCharacterCode
a number that's larger than 16 bits, it'll just look at the lowest 16 bits of the number. That explains the pattern I was seeing. So as it turns out, I only need to worry about the first range.
It seems you're trying to remove Unicode surrogate code units from the string. However, only U+D800 through U+DFFF are surrogate code points; the remaining values you name are not, and could be allocated to valid Unicode characters. In that case, the following will suffice (use \u
rather than \x
to refer to Unicode characters):
buffer.replace(/[\ud800-\udfff]/g, "");