Let's assume a user enter address of some resource and we need to translate it to:
<a href="valid URI here">human readable form</a>
HTML4 specification refers to RFC 3986 which allows only ASCII alphanumeric characters and dash in host part and all non-ASCII character in other parts should be percent-encoded. That's what I want to put in href attribute to make link working properly in all browsers. IDN should be encoded with Punycode.
HTML5 draft refers to RFC 3987 which also allows percent-encoded unicode characters in host part and a large subset of unicode in both host and other parts without encoding them. User may enter address in any of these forms. To provide human readable form of it I need to decode all printable characters. Note that some parts of address might not correspond to valid UTF-8 sequences, usually when target site uses some other character encoding.
An example of what I'd like to get:
<a href="http://xn--80aswg.xn--p1ai/%D0%BF%D1%83%D1%82%D1%8C?%D0%B7%D0%B0%D0%BF%D1%80%D0%BE%D1%81">
http://сайт.рф/путь?запрос</a>
Are there any tools to solve these tasks? I'm especially interested in libraries for Python and JavaScript.
Update: I know there is a way to do percent and Punycode (without proper normalization, but I can live with it) encoding/decoding in Python and JavaScript. The whole task needs much more work and there are some pitfalls (some characters should be always encoded or never encoded depending on context). I wonder if there are ready to use libraries for the whole problem, since it seems to be quite common and modern browsers already do such conversions (try typing http://%D1%81%D0%B0%D0%B9%D1%82.%D1%80%D1%84/
in Google Chrome and it will be replaced with http://сайт.рф/
, but use Host: xn--80aswg.xn--p1ai
in HTTP request).
Update2: Vinay Sajip pointed that Werkzeug has iri_to_uri and uri_to_iri functions that handles most cases correctly. I've found only 2 cases where it fails so far: percent-encoded host (quite easy to fix) and invalid utf-8 sequences (it's a bit tricky to do nicely, but shouldn't be a problem).
I'm still looking for library in JavaScript. It's not hard to write, but I'd prefer to avoid inventing the wheel.
If I understand you correctly, then you can use the batteries included in Python:
# -*- coding: utf-8 -*-
import urllib
import urlparse
URL1 = u'http://сайт.рф/путь?запрос'
URL2 = 'http://%D1%81%D0%B0%D0%B9%D1%82.%D1%80%D1%84/'
def to_idn(url):
parts = list(urlparse.urlparse(url))
parts[1] = parts[1].encode('idna')
parts[2:] = [urllib.quote(s.encode('utf-8')) for s in parts[2:]]
return urlparse.urlunparse(parts)
def from_idn(url):
return urllib.unquote(url)
print to_idn(URL1)
print from_idn(URL2)
print to_idn(from_idn(URL2).decode('utf-8'))
which prints
http://xn--80aswg.xn--p1ai/%D0%BF%D1%83%D1%82%D1%8C?%D0%B7%D0%B0%D0%BF%D1%80%D0%BE%D1%81
http://сайт.рф/
http://xn--80aswg.xn--p1ai/
which looks like what you want. I'm not sure what special cases you mean - perhaps you could give some examples of the pitfalls you're referring to?
Update: I just remembered, Werkzeug has iri_to_uri
and uri_to_iri
functions in versions 0.6 and later (links are to the relevant part of the docs).
Further update: Sorry, I hadn't noticed that you're looking for a JavaScript implementation as well as a Python one. An existing public domain Javascript implementation of punycode is here. I can't vouch for it, though. And of course you can use the built-in JavaScript encodeURI
/decodeURI
APIs.