javascriptpythonhtmlidniri

IDN aware tools to encode/decode human readable IRI to/from valid URI


Let's assume a user enter address of some resource and we need to translate it to:

<a href="valid URI here">human readable form</a>

HTML4 specification refers to RFC 3986 which allows only ASCII alphanumeric characters and dash in host part and all non-ASCII character in other parts should be percent-encoded. That's what I want to put in href attribute to make link working properly in all browsers. IDN should be encoded with Punycode.

HTML5 draft refers to RFC 3987 which also allows percent-encoded unicode characters in host part and a large subset of unicode in both host and other parts without encoding them. User may enter address in any of these forms. To provide human readable form of it I need to decode all printable characters. Note that some parts of address might not correspond to valid UTF-8 sequences, usually when target site uses some other character encoding.

An example of what I'd like to get:

<a href="http://xn--80aswg.xn--p1ai/%D0%BF%D1%83%D1%82%D1%8C?%D0%B7%D0%B0%D0%BF%D1%80%D0%BE%D1%81">
http://сайт.рф/путь?запрос</a>

Are there any tools to solve these tasks? I'm especially interested in libraries for Python and JavaScript.

Update: I know there is a way to do percent and Punycode (without proper normalization, but I can live with it) encoding/decoding in Python and JavaScript. The whole task needs much more work and there are some pitfalls (some characters should be always encoded or never encoded depending on context). I wonder if there are ready to use libraries for the whole problem, since it seems to be quite common and modern browsers already do such conversions (try typing http://%D1%81%D0%B0%D0%B9%D1%82.%D1%80%D1%84/ in Google Chrome and it will be replaced with http://сайт.рф/, but use Host: xn--80aswg.xn--p1ai in HTTP request).

Update2: Vinay Sajip pointed that Werkzeug has iri_to_uri and uri_to_iri functions that handles most cases correctly. I've found only 2 cases where it fails so far: percent-encoded host (quite easy to fix) and invalid utf-8 sequences (it's a bit tricky to do nicely, but shouldn't be a problem).

I'm still looking for library in JavaScript. It's not hard to write, but I'd prefer to avoid inventing the wheel.


Solution

  • If I understand you correctly, then you can use the batteries included in Python:

    # -*- coding: utf-8 -*-
    
    import urllib
    import urlparse
    
    URL1 = u'http://сайт.рф/путь?запрос'
    URL2 = 'http://%D1%81%D0%B0%D0%B9%D1%82.%D1%80%D1%84/'
    
    def to_idn(url):
        parts = list(urlparse.urlparse(url))
        parts[1] = parts[1].encode('idna')
        parts[2:] = [urllib.quote(s.encode('utf-8')) for s in parts[2:]]
        return urlparse.urlunparse(parts)
    
    def from_idn(url):
        return urllib.unquote(url)
    
    print to_idn(URL1)
    print from_idn(URL2)
    print to_idn(from_idn(URL2).decode('utf-8'))
    

    which prints

    http://xn--80aswg.xn--p1ai/%D0%BF%D1%83%D1%82%D1%8C?%D0%B7%D0%B0%D0%BF%D1%80%D0%BE%D1%81
    http://сайт.рф/
    http://xn--80aswg.xn--p1ai/
    

    which looks like what you want. I'm not sure what special cases you mean - perhaps you could give some examples of the pitfalls you're referring to?

    Update: I just remembered, Werkzeug has iri_to_uri and uri_to_iri functions in versions 0.6 and later (links are to the relevant part of the docs).

    Further update: Sorry, I hadn't noticed that you're looking for a JavaScript implementation as well as a Python one. An existing public domain Javascript implementation of punycode is here. I can't vouch for it, though. And of course you can use the built-in JavaScript encodeURI/decodeURI APIs.