httpurlurl-encodingrfc

When should an asterisk be encoded in an HTTP URL?


According to RFC1738, an asterisk (*) "may be used unencoded within a URL":

Thus, only alphanumerics, the special characters "$-_.+!*'(),", and reserved characters used for their reserved purposes may be used unencoded within a URL.

However, w3.org's Naming and Addressing material says that the asterisk is "reserved for use as having special signifiance within specific schemes" and implies that it should be encoded.

Also, according to RFC3986, a URL is a URI:

The term "Uniform Resource Locator" (URL) refers to the subset of URIs that, in addition to identifying a resource, provide a means of locating the resource by describing its primary access mechanism (e.g., its network "location").

It also specifies that the asterisk is a "sub-delim", which is part of the "reserved set" and:

URI producing applications should percent-encode data octets that correspond to characters in the reserved set unless these characters are specifically allowed by the URI scheme to represent data in that component.

It also explicitly specifies that it updates RFC1738.

I read all of this as requiring that asterisks be encoded in a URL unless they are used for a special purpose defined by the URI scheme.

Is RFC1738 the canonical reference for the HTTP URI scheme? Does it somehow exempt the asterisk from encoding, or is it obsolete in that regard due to RFC3986?

Wikipedia says that "[t]he character does not need to be percent-encoded when it has no reserved purpose." Does RFC1738 remove the reserved purpose of the asterisk?

Various resources and tools seems split on this question.

PHP's urlencode and rawurlencode-- the latter of which purports to follow RFC3986 -- do encode the asterisk.

However, JavaScript's escape and encodeURIComponent do not encode the asterisk.

And Java's URLEncoder does not encode the asterisk:

The special characters ".", "-", "*", and "_" remain the same.

Popular online tools (top two results for a Google search for "online url encoder") also do not encode the asterisk. The URL Encode and Decode Tool specifically states that "[t]he reserved characters have to be encoded only under certain circumstances." It goes on to list the asterisk and ampersand as reserved characters. It encodes the ampersand but not the asterisk.

Other similar questions in the Stack Exchange community seem to have stale, incomplete, or unconvincing answers:

With all this in mind, when should an asterisk be encoded in an HTTP URL?


Solution

  • Short answer

    The current definition of URL syntax and W3 standards indicates that you never need to percent-encode the asterisk character in the path, query, or fragment components of a URL:


    HTTP 1.1

    RFC 3986 (current URI syntax)

    Asterisk is reserved as a sub-delimiter: a character that has no special meaning in the generic URI syntax, but can be used by implementations to subdivide a component. The path, query and fragment parts of a URI all allow the set of characters defined as pchar, which includes sub-delimiters:

    reserved      = gen-delims / sub-delims
    sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
                  / "*" / "+" / "," / ";" / "="
    ...
    pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"
    

    Interestingly the HTTP spec doesn't specify the meaning of any additional delimiters in these URI components that you might associate with HTTP. The use of & as a delimiter for query strings is specified by W3:

    4.10.22.6 URL-encoded form data

    Note: This form data set encoding is in many ways an aberrant monstrosity, the result of many years of implementation accidents and compromises leading to a set of requirements necessary for interoperability, but in no way representing good design practices.

    ...

    1. Let strings be the result of strictly splitting the string payload on U+0026 AMPERSAND characters (&).

    RFC 2396 (URI spec before January 2005)

    Obsoleted by RFC 3986 above.

    * is listed as an "unreserved character" in RFC 2396, which is used to define URI syntax in HTTP 1.1. Unreserved characters are allowed in the path component of a URL.

    2.3. Unreserved Characters

    Data characters that are allowed in a URI but do not have a reserved purpose are called unreserved. These include upper and lower case letters, decimal digits, and a limited set of punctuation marks and symbols.

       unreserved  = alphanum | mark
    
       mark        = "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")"
    

    Unreserved characters can be escaped without changing the semantics of the URI, but this should not be done unless the URI is being used in a context that does not allow the unescaped character to appear.

    HTTP 1.0

    HTTP 1.0 references RFC 1738 to define URL syntax, which through a series of updates and obsoletes means it uses the same RFC as HTTP 1.1 for URL syntax.

    As far as backwards compatibility goes, RFC 1738 specifies the asterisk as an unreserved character which can be used unencoded:

    Thus, only alphanumerics, the special characters "$-_.+!*'(),", and reserved characters used for their reserved purposes may be used unencoded within a URL.

    This means you're safe to use an asterisk unencoded, even in URLs pointing to the oldest of systems.


    As a side note, the asterisk character does have a special meaning in a Request-URI in both HTTP specs, but it's not possible to represent it with an HTTP URL:

    The asterisk "*" means that the request does not apply to a particular resource, but to the server itself, and is only allowed when the method used does not necessarily apply to a resource. One example would be

       OPTIONS * HTTP/1.1
    

    Disclaimer: I'm just reading and interpreting these RFCs myself, so I may be wrong.