In my Ruby app I need to handle URIs from user input (which are actually IRIs)
str = "http://उदाहरण.परीक्षा/मुख्य_पृष्ठ"
I normalize these using Addressable, and only store the normalized form:
normalized = Addressable::URI.parse(str).normalize
normalized.to_s
#=> http://xn--p1b6ci4b4b3a.xn--11b5bs3a9aj6g/%E0%A4%AE%E0%A5%81%E0%A4%96%E0%A5%8D%E0%A4%AF_%E0%A4%AA%E0%A5%83%E0%A4%B7%E0%A5%8D%E0%A4%A0
This is nice to work with, but obviously not nice to display to end users.
For that I'd like to convert this URI back to its original form (non-punycode, non-percent-encoded-path)
Addressable has display_uri
, but that only converts the host:
nicer = normalized.display_uri.to_s
#=> http://उदाहरण.परीक्षा/%E0%A4%AE%E0%A5%81%E0%A4%96%E0%A5%8D%E0%A4%AF_%E0%A4%AA%E0%A5%83%E0%A4%B7%E0%A5%8D%E0%A4%A0
This looks like it works:
display_s = Addressable::URI.parse(str).display_uri.to_s
pretty = Addressable::URI.unencode(display_s.force_encoding("ASCII-8BIT"))
However, that code looks wrong (I should not need to use force_encoding
) and I'm not at all confident that it is correct.
What is a good, sane way to convert the entire URI to something usable for end users ("http://उदाहरण.परीक्षा/मुख्य_पृष्ठ"
)
is storing the URIs normalized even a good idea or does that have consequences I might not be aware of?
code: https://gist.github.com/levinalex/6115764
how do I convert this:
"http://xn--p1b6ci4b4b3a.xn--11b5bs3a9aj6g/" +
"%E0%A4%AE%E0%A5%81%E0%A4%96%E0%A5%8D%E0%A4" +
"%AF_%E0%A4%AA%E0%A5%83%E0%A4%B7%E0%A5%8D%E0%A4%A0"
to this:
"http://उदाहरण.परीक्षा/मुख्य_पृष्ठ"
You should not need any forced (re-)encoding to recover the original URI. Simply:
normalised_s = "http://xn--p1b6ci4b4b3a.xn--11b5bs3a9aj6g/%E0%A4%AE%E0%A5%81%E0%A4%96%E0%A5%8D%E0%A4%AF_%E0%A4%AA%E0%A5%83%E0%A4%B7%E0%A5%8D%E0%A4%A0"
Addressable::URI.unencode(Addressable::URI.parse(normalised_s).display_uri)
=> "http://उदाहरण.परीक्षा/मुख्य_पृष्ठ"
To repeat what Bob said in the comments, normalisation is definitely a good way of guaranteeing uniqueness for storage.