haskellutf-8overloaded-strings

Utf8 and overloaded strings in Haskell


I realized that accents in my texts get converted to �. I boiled it down, to the following example, which writes (and overwrites) the file test.txt.

It uses exclusively methods from Data.Text, which are supposed to handle unicode texts. I checked that both the source file as well the output file are encoded in utf8.

{-# LANGUAGE OverloadedStrings #-}

import Prelude hiding (writeFile)
import Data.Text
import Data.Text.IO

someText :: Text
someText = "Université"

main :: IO ()
main = do 
    writeFile "test.txt" someText

After running the code, test.txt contains: Universit�. In ghci, I get the following

*Main> someText
"Universit\233"

Is this already encoded incorrectly? I also found a comment on � in https://hackage.haskell.org/package/text-1.2.2.2/docs/Data-Text.html, but I still do not know how to correct the example above.

How do I use accents in an OverloadedString and correctly write them to a file?


Solution

  • This has nothing to do with Data.Text, and certainly not with OverloadedStrings – both handle UTF-8–Unicode just fine.

    However Data.Text.IO will not write a BOM or anything that indicates the encoding, i.e. the file really just contains the text as-is. On any modern system, this means it will be in raw UTF-8 form:

    sagemuej@sagemuej-X302LA:~$ xxd test.txt 
    00000000: 556e 6976 6572 7369 74c3 a9              Universit..
    sagemuej@sagemuej-X302LA:~$ cat test.txt 
    Université
    

    So depending on what editor you open the file with, it may guess a wrong encoding, and that's apparently your issue. On Linux, UTF-8 has long been the standard, so no issue here, but Windows isn't so up-to-date. It should be possible to manually select the encoding in any editor, though.

    In fact, Data.Text.IO.writeFile will use your locale to decide how to encode the file. Everybody should have UTF-8 as their locale nowadays, if you don't please change that.

    To get a BOM in your file and thus preclude such issues, use utf8_bom.

    Regarding the output you see in GHCi: that's the Show instance at work; it escapes any string-like values to the safest conceivable form, i.e. anything that's not ASCII to an escape sequence, which for 'é' happens to be '\233'. Again not specific to Text, in fact you get this even for single characters:

    Prelude> 'é'
    '\233'
    Prelude> putChar '\233'
    é
    

    This escaping never happens when you use the direct-IO-output actions for your string types, i.e. putChar, putStr or putStrLn.

    Prelude> import qualified Data.Text.IO as Txt
    Prelude Txt> Txt.putStrLn "Université"
    Université