haskellunicodeaeson

Aeson does not decode strings with unicode characters


I'm trying to use Data.Aeson (https://hackage.haskell.org/package/aeson-0.6.1.0/docs/Data-Aeson.html) to decode some JSON strings, however it is failing to parse strings that contain non-standard characters.

As an example, the file:

import Data.Aeson
import Data.ByteString.Lazy.Char8 (pack)

test1 :: Maybe Value
test1 = decode $ pack "{ \"foo\": \"bar\"}"

test2 :: Maybe Value
test2 = decode $ pack "{ \"foo\": \"bòz\"}"

When run in ghci, gives the following results:

*Main> :l ~/test.hs
[1 of 1] Compiling Main             ( /Users/ltomlin/test.hs, interpreted )
Ok, modules loaded: Main.
*Main> test1
Just (Object fromList [("foo",String "bar")])
*Main> test2
Nothing

Is there a reason that it doesn't parse the String with the unicode character? I was under the impression that Haskell was pretty good with unicode. Any suggestions would be greatly appreciated!

Thanks,

tetigi

EDIT

Upon further investigation using eitherDecode, I get the following error message:

 *Main> test2
 Left "Failed reading: Cannot decode byte '\\x61': Data.Text.Encoding.decodeUtf8: Invalid UTF-8 stream"

x61 is the unicode character for 'z', which comes right after the special unicode character. Not sure why it's failing to read the characters after the special character!

Changing test2 to be test2 = decode $ pack "{ \"foo\": \"bòz\"}" instead gives the error:

Left "Failed reading: Cannot decode byte '\\xf2': Data.Text.Encoding.decodeUtf8: Invalid UTF-8 stream"

Which is the character for "ò", which makes a bit more sense.


Solution

  • The problem is your usage of pack from the Char8 module, which doesn't work with non-Latin 1 data. Instead, use encodeUtf8 from text.

    You can write your examples like this:

    import Data.Aeson
    import Data.Text.Lazy (pack)
    import Data.Text.Lazy.Encoding (encodeUtf8)
    
    test1 :: Maybe Value
    test1 = decode $ encodeUtf8 $ pack "{ \"foo\": \"bar\"}"
    
    test2 :: Maybe Value
    test2 = decode $ encodeUtf8 $ pack "{ \"foo\": \"bòz\"}"