javaencodinguripercent-encoding

java.net.URI and percent in query parameter value


System.out.println(
    new URI("http", "example.com", "/servlet", "a=x%20y", null));

The result is http://example.com/servlet?a=x%2520y, where the query parameter value differs from the supplied one. Strange, but this does follow the Javadoc:

"The percent character ('%') is always quoted by these constructors."

We can pass the decoded string, a=x y and then we get a reasonable(?) result a=x%20y.

But what if the query parameter value contains an "&" character? This happens for example if the value is an URL itself with query parameters. Look at this (wrong) query string: a=b&c. The ampersand must be escaped here (a=b%26c), otherwise this can be considered as a query parameter a=b and some garbage (c). If I pass this to an URI constructor, it encodes it, and returns a wrong URL: ...?a=b%2526c

This issue seems to render java.util.URI useless. Am I missing something here?

Summary of answers

java.net.URI does know about the existence of the query part of an URI, but it does not understand the internals of the query part, which can differ for each scheme. For example java.net.URI does not understand the internal structure of the HTTP query part. This would not be a problem, if java.net.URI considered query as an opaque string, and did not alter it. But it tries to apply some generic percent-encoding algorithm, which breaks HTTP URLs.

Therefore I cannot use the URI class to reliably assemble an URL from its parts, despite there are constructors for it. I would also mention that as of Java 7, the implementation of the relativize operation is quite limited, only works if one URL is the prefix of another one. These two functionality (and its leaner interface for these purposes) were the reason why I was interested in java.net.URI, but neither of them works for me.

At the end I used java.net.URL for parsing, and wrote code to assemble an URL from parts and to relativize two URLs. I also checked the Apache HttpClient URIBuilder class, and although it does understand the internals of an HTTP query string, but as of 4.3, it has the same problem with encoding like java.net.URI when dealing with the query part as a whole.


Solution

  • The query string

    a=b&c
    

    is not wrong in a URI. The RFC on URI Generic Syntax states

    The query component is a string of information to be interpreted by the resource.

      query         = *uric
    

    Within a query component, the characters ";", "/", "?", ":", "@",
    "&", "=", "+", ",", and "$" are reserved.

    The character & in the query string is very much valid (uric represents reserved, mark, and alphanumeric characters). The RFC also states

    Many URI include components consisting of or delimited by, certain
    special characters. These characters are called "reserved", since
    their usage within the URI component is limited to their reserved
    purpose. If the data for a URI component would conflict with the
    reserved purpose, then the conflicting data must be escaped before
    forming the URI.

    Because the & is valid but reserved, it is up to the user to determine if it is meant to be encoded or not.

    What you call a query parameter is not a feature of a URI and therefore the URI class has no reason to (and shouldn't) support it.

    Related: