javalanguage-agnosticrfcrfc3986

RFC3986 - which pchars need to be percent-encoded?


I need to generate a href to a URI. All easy with the exception when it comes to reserved characters which need percent-encoding, e.g. link to /some/path;element should appear as <a href="/some/path%3Belement"> (I know that path;element represents a single entity).

Initially I was looking for a Java library that does this but I ended up writing something myself (look below for what failed with Java, as this question isn't Java-specific).

So, RFC 3986 does suggest when NOT to encode. This should happen, as I read it, when character falls under unreserved (ALPHA / DIGIT / "-" / "." / "_" / "~") class. So far so good. But what about the opposite case? RFC only mentions that percent (%) always needs encoding. But what about the others?

Question: is it correct to assume that everything that is not unreserved, can/should be percent-encoded? For example, opening bracket ( does not necessarily need encoding but semicolon ; does. If I don't encode it I end up looking for /first* when following <a href="/first;second">. But following <a href="/first(second"> I always end up looking for /first(second, as expected. What confuses me is that both ( and ; are in the same sub-delims class as far as RFC goes. As I imagine, encoding everything non-unreserved is a safe bet, but what about SEOability, user friendliness when it comes to localized URIs?

Now, what failed with Java libs. I have tried doing it like
new java.net.URI("http", "site", "/pa;th", null).toASCIISTring()
but this gives http://site/pa;th which is no good. Similar results observed with:

[*] /first is a result of call to HttpServletRequest.getServletPath() in the server side when clicking on <a href="/first;second">

EDIT: I probably need to mention that this behaviour was observed under Tomcat, and I have checked both Tomcat 6 and 7 behave the same way.


Solution

  • The ABNF for an absolute path part:

     path-absolute = "/" [ segment-nz *( "/" segment ) ]
     segment       = *pchar
     segment-nz    = 1*pchar
     pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"
     pct-encoded   = "%" HEXDIG HEXDIG
     unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
     reserved      = gen-delims / sub-delims
     sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
                   / "*" / "+" / "," / ";" / "="
    

    pchar includes sub-delims so you would not have to encode any of these in the path part: :@-._~!$&'()*+,;=

    I wrote my own URL builder which includes an encoder for the path - as always, caveat emptor.