url

Characters allowed in a URL


Does anyone know the full list of characters that can be used within a GET without being encoded? At the moment I am using A-Z a-z and 0-9... but I am looking to find out the full list.

I am also interested into if there is a specification released for the up coming addition of Chinese, Arabic url's (as obviously that will have a big impact on my question)


Solution

  • EDIT: As @Jukka K. Korpela correctly points out, RFC 1738 was updated by RFC 3986. This has expanded and clarified the characters valid for host, unfortunately it's not easily copied and pasted, but I'll do my best.

    In first matched order:

    host        = IP-literal / IPv4address / reg-name
    
    IP-literal  = "[" ( IPv6address / IPvFuture  ) "]"
    
    IPvFuture   = "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" )
    
    IPv6address =         6( h16 ":" ) ls32
                      /                       "::" 5( h16 ":" ) ls32
                      / [               h16 ] "::" 4( h16 ":" ) ls32
                      / [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32
                      / [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32
                      / [ *3( h16 ":" ) h16 ] "::"    h16 ":"   ls32
                      / [ *4( h16 ":" ) h16 ] "::"              ls32
                      / [ *5( h16 ":" ) h16 ] "::"              h16
                      / [ *6( h16 ":" ) h16 ] "::"
    
    ls32        = ( h16 ":" h16 ) / IPv4address
                      ; least-significant 32 bits of address
    
    h16         = 1*4HEXDIG 
                   ; 16 bits of address represented in hexadecimal
    
    IPv4address = dec-octet "." dec-octet "." dec-octet "." dec-octet
    
    dec-octet   = DIGIT                 ; 0-9
                  / %x31-39 DIGIT         ; 10-99
                  / "1" 2DIGIT            ; 100-199
                  / "2" %x30-34 DIGIT     ; 200-249
                  / "25" %x30-35          ; 250-255
    
    reg-name    = *( unreserved / pct-encoded / sub-delims )
    
    unreserved  = ALPHA / DIGIT / "-" / "." / "_" / "~"     <---This seems like a practical shortcut, most closely resembling original answer
    
    reserved    = gen-delims / sub-delims
    
    gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"
    
    sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
                  / "*" / "+" / "," / ";" / "="
    
    pct-encoded = "%" HEXDIG HEXDIG
    

    Original answer from RFC 1738 specification:

    Thus, only alphanumerics, the special characters "$-_.+!*'(),", and reserved characters used for their reserved purposes may be used unencoded within a URL.

    ^ obsolete since 1998.