I need to generate a href
to a URI. All easy with the exception when it comes to reserved characters which need percent-encoding, e.g. link to /some/path;element
should appear as <a href="/some/path%3Belement">
(I know that path;element
represents a single entity).
Initially I was looking for a Java library that does this but I ended up writing something myself (look below for what failed with Java, as this question isn't Java-specific).
So, RFC 3986 does suggest when NOT to encode. This should happen, as I read it, when character falls under unreserved (ALPHA / DIGIT / "-" / "." / "_" / "~")
class. So far so good. But what about the opposite case? RFC only mentions that percent (%
) always needs encoding. But what about the others?
Question: is it correct to assume that everything that is not unreserved, can/should be percent-encoded? For example, opening bracket (
does not necessarily need encoding but semicolon ;
does. If I don't encode it I end up looking for /first
* when following <a href="/first;second">
. But following <a href="/first(second">
I always end up looking for /first(second
, as expected. What confuses me is that both (
and ;
are in the same sub-delims
class as far as RFC goes. As I imagine, encoding everything non-unreserved is a safe bet, but what about SEOability, user friendliness when it comes to localized URIs?
Now, what failed with Java libs. I have tried doing it like
new java.net.URI("http", "site", "/pa;th", null).toASCIISTring()
but this gives http://site/pa;th
which is no good. Similar results observed with:
javax.ws.rs.core.UriBuilder
encodePath(String, String)
and encodePathSegment(String, String)
[*] /first
is a result of call to HttpServletRequest.getServletPath()
in the server side when clicking on <a href="/first;second">
EDIT: I probably need to mention that this behaviour was observed under Tomcat, and I have checked both Tomcat 6 and 7 behave the same way.
The ABNF for an absolute path part:
path-absolute = "/" [ segment-nz *( "/" segment ) ]
segment = *pchar
segment-nz = 1*pchar
pchar = unreserved / pct-encoded / sub-delims / ":" / "@"
pct-encoded = "%" HEXDIG HEXDIG
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
reserved = gen-delims / sub-delims
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
pchar
includes sub-delims so you would not have to encode any of these in the path part: :@-._~!$&'()*+,;=
I wrote my own URL builder which includes an encoder for the path - as always, caveat emptor.