haskellhxt

HXT: Handling byte order mark from HTTP response body


Using HXT I'm parsing the XML response body of an HTTP call that was made using http-conduit.

val <- runX $ readString [withValidate no] (Data.ByteString.UTF8.toString . toStrict $ getResponseBody response) >>> getChildren >>> ...

Depending on the version of the API, I found that the response body includes a byte order mark before the XML:

error: ""\65279<?xml version=\"1.0\" encoding=\"utf-8\"?><Enume..."" (line 1, column 1):
unexpected "\65279"
expecting xml declaration, comment, processing instruction, "<!DOCTYPE" or "<"

Since the BOM may or may not be there, I did the following:

...
let resBody = Data.ByteString.UTF8.toString . toStrict $ getResponseBody response
    parseBody body = runX $ readString [withValidate no] body >>> getChildren >>> ...
xs <- parseBody resBody
val <- case xs of
  x : _ -> pure x
  _ -> head <$> (parseBody $ drop 1 resBody)
...

It works, but it's printing the error message when the BOM is present. What are the options for parsing the XML with a possible BOM so that it's not printing error messages?


Solution

  • Okay, given that you're willing to assume the encoding is UTF-8 as you do here, then probably the simplest is to just pattern match to discard a BOM:

    case toString ... of
        '\65279':s -> s
        s -> s
    

    As an aside, having just looked through the XML spec to see how encodings are supposed to be handled, let me just say: eugh, gross. There appears to be no encoding-agnostic way to specify what encoding to use, so the only really correct, robust thing to do is try a bunch during parsing and hope one succeeds.