htmlhtml4

What technique does Wikipedia use to convert non-latin characters to be ID safe?


I am trying to auto-generate IDs for headings, to be used as anchor links. Because we don't know what language the heading is in, it should work for any language and not create illegal characters.

So I was looking at Wikipedia to see how it does that job, but can't figure out how exactly they are doing it. For example, the heading Ссылки is transformed to .D0.A1.D1.81.D1.8B.D0.BB.D0.BA.D0.B8, and used as the ID of the heading DOM.

Does anyone have any insight as to how that's done?


Solution

  • Okay I think I've figured it out. Wikipedia is using Latin-1 encoding to represent the text as hex code for each byte. Following is some Ruby code I wrote to demonstrate the process:

    # for arbitrary input `text`, force encode with Latin-1
    encoded_text = text.force_encoding('iso-8859-1')
    
    # Extract the string as plaintext, with literal hex escape character "/x"
    plaintext_encoded_text = /\A"(.*)"\z/.match(encoded_text.inspect)[1]
    
    # Replace "/x" with "." and spaces with "-"
    output = plaintext_encoded_text.gsub('\x', '.').gsub(/\s/, '-')
    

    This process will convert Ссылки to .D0.A1.D1.81.D1.8B.D0.BB.D0.BA.D0.B8, which is matching what appears on Wikipedia. Also, Latin characters are not affected.