javascriptunicodeuriencodeuricomponent

encodeURIComponent throws an exception


I am programmatically building a URI with the help of the encodeURIComponent function using user provided input. However, when the user enters invalid unicode characters (such as U+DFFF), the function throws an exception with the following message:

The URI to be encoded contains an invalid character

I looked this up on MSDN, but that didn't tell me anything I didn't already know.

To correct this error

  • Ensure the string to be encoded contains only valid Unicode sequences.

My question is, is there a way to sanitize the user provided input to remove all invalid Unicode sequences before I pass it on to the encodeURIComponent function?


Solution

  • Taking the programmatic approach to discover the answer, the only range that turned up any problems was \ud800-\udfff, the range for high and low surrogates:

    for (var regex = '/[', firstI = null, lastI = null, i = 0; i <= 65535; i++) {
        try {
            encodeURIComponent(String.fromCharCode(i));
        }
        catch(e) {
            if (firstI !== null) {
                if (i === lastI + 1) {
                    lastI++;
                }
                else if (firstI === lastI) {
                    regex += '\\u' + firstI.toString(16);
                    firstI = lastI = i; 
                }
                else {
                    regex += '\\u' + firstI.toString(16) + '-' + '\\u' + lastI.toString(16);
                    firstI = lastI = i; 
                }
            }
            else {
                firstI = i;
                lastI = i;
            }        
        }
    }
    
    if (firstI === lastI) {
        regex += '\\u' + firstI.toString(16);
    }
    else {
        regex += '\\u' + firstI.toString(16) + '-' + '\\u' + lastI.toString(16);
    }
    regex += ']/';
    alert(regex);  // /[\ud800-\udfff]/
    

    I then confirmed this with a simpler example:

    for (var i = 0; i <= 65535 && (i <0xD800 || i >0xDFFF ) ; i++) {
        try {
            encodeURIComponent(String.fromCharCode(i));
        }
        catch(e) {
            alert(e); // Doesn't alert
        }
    }
    alert('ok!');
    

    And this fits with what MSDN says because indeed all those Unicode characters (even valid Unicode "non-characters") besides surrogates are all valid Unicode sequences.

    You can indeed filter out high and low surrogates, but when used in a high-low pair, they become legitimate (as they are meant to be used in this way to allow for Unicode to expand (drastically) beyond its original maximum number of characters):

    alert(encodeURIComponent('\uD800\uDC00')); // ok
    alert(encodeURIComponent('\uD800')); // not ok
    alert(encodeURIComponent('\uDC00')); // not ok either
    

    So, if you want to take the easy route and block surrogates, it is just a matter of:

    urlPart = urlPart.replace(/[\ud800-\udfff]/g, '');
    

    If you want to strip out unmatched (invalid) surrogates while allowing surrogate pairs (which are legitimate sequences but the characters are rarely ever needed), you can do the following:

    function stripUnmatchedSurrogates (str) {
        return str.replace(/[\uD800-\uDBFF](?![\uDC00-\uDFFF])/g, '').split('').reverse().join('').replace(/[\uDC00-\uDFFF](?![\uD800-\uDBFF])/g, '').split('').reverse().join('');
    }
    
    var urlPart = '\uD801 \uD801\uDC00 \uDC01'
    alert(stripUnmatchedSurrogates(urlPart)); // Leaves one valid sequence (representing a single non-BMP character)
    

    If JavaScript had negative lookbehind the function would be a lot less ugly...