I realized that accents in my texts get converted to �. I boiled it down, to the following example, which writes (and overwrites) the file test.txt.
It uses exclusively methods from Data.Text, which are supposed to handle unicode texts. I checked that both the source file as well the output file are encoded in utf8.
{-# LANGUAGE OverloadedStrings #-}
import Prelude hiding (writeFile)
import Data.Text
import Data.Text.IO
someText :: Text
someText = "Université"
main :: IO ()
main = do
writeFile "test.txt" someText
After running the code, test.txt contains: Universit�. In ghci, I get the following
*Main> someText
"Universit\233"
Is this already encoded incorrectly? I also found a comment on � in https://hackage.haskell.org/package/text-1.2.2.2/docs/Data-Text.html, but I still do not know how to correct the example above.
How do I use accents in an OverloadedString and correctly write them to a file?
This has nothing to do with Data.Text
, and certainly not with OverloadedStrings
– both handle UTF-8–Unicode just fine.
However Data.Text.IO
will not write a BOM or anything that indicates the encoding, i.e. the file really just contains the text as-is. On any modern system, this means it will be in raw UTF-8 form:
sagemuej@sagemuej-X302LA:~$ xxd test.txt
00000000: 556e 6976 6572 7369 74c3 a9 Universit..
sagemuej@sagemuej-X302LA:~$ cat test.txt
Université
So depending on what editor you open the file with, it may guess a wrong encoding, and that's apparently your issue. On Linux, UTF-8 has long been the standard, so no issue here, but Windows isn't so up-to-date. It should be possible to manually select the encoding in any editor, though.
In fact, Data.Text.IO.writeFile
will use your locale to decide how to encode the file. Everybody should have UTF-8 as their locale nowadays, if you don't please change that.
To get a BOM in your file and thus preclude such issues, use utf8_bom
.
Regarding the output you see in GHCi: that's the Show
instance at work; it escapes any string-like values to the safest conceivable form, i.e. anything that's not ASCII to an escape sequence, which for 'é'
happens to be '\233'
. Again not specific to Text
, in fact you get this even for single characters:
Prelude> 'é'
'\233'
Prelude> putChar '\233'
é
This escaping never happens when you use the direct-IO-output actions for your string types, i.e. putChar
, putStr
or putStrLn
.
Prelude> import qualified Data.Text.IO as Txt
Prelude Txt> Txt.putStrLn "Université"
Université