htmlxmlemacshtml-entitiessgml

Replace non-ASCII characters with SGML entity codes with Emacs


I have a HTML file with a few non-ASCII characters, say encoded in UTF-8 or UTF-16. To save the file in ASCII, I would like to replace them with their (SGML/HTML/XML) entity codes. So for example, every ë should become ë and every should become ◊. How do I do that?

I use Emacs as an editor. I'm sure it has a function to do the replace, but I cannot find it. What am I missing? Or how do I implement it myself?


Solution

  • There is a character class which includes exactly the ASCII character set. You can use a regexp that matches its complement to find occurrences of non-ASCII characters, and then replace them with their codes using elisp:

    M-x replace-regexp RET
    [^[:ascii:]] RET
    \,(concat "&#" (number-to-string (string-to-char \&)) ";") RET
    

    So when, for example, á is matched: \& is "á", string-to-char converts it to (= the number 225), and number-to-string converts that to "225". Then, concat concatenates "&#", "225" and ";" to get "á", which replaces the original match.

    Surround these commands with C-x ( and C-x ), and apply C-x C-k n and M-x insert-kbd-macro as usual to make a function out of them.


    To see the elisp equivalent of calling this function interactively, run the command and then press C-x M-: (Repeat complex command).

    A simpler version, which doesn't take into account the active region, could be:

    (while (re-search-forward "[^[:ascii:]]" nil t)
      (replace-match (concat "&#"
                             (number-to-string (string-to-char (match-string 0)))
                             ";")))
    

    (This uses the recommended way to do search + replace programmatically.)