parsinghaskellattoparsec

Convert Attoparsec parser to parse from another string type


Is there some "easy" way (e.g. something I am missing in Attoparsec or some other library) to convert a defined Attoparsec parser that parses from ByteString to the one that parses from Text?

For example I have:

import Data.Attoparsec.ByteString.Char8
myTypeByteStringParser :: Parser MyType

What's the way to transform it into:

import Data.Attoparsec.Text
myTypeTextParser :: Parser MyType

It does look like contramap (from hoogling type signature) but it is probably not possible to define Contravariant for Parser?


Solution

  • This is possible in general and you don't need to fork attoparsec. Inconsiderately attoparsec doesn't expose enough of its internals, but don't let that stop us:

    {-# LANGUAGE LambdaCase #-}
    {-# LANGUAGE QuasiQuotes #-}
    
    module Parsers where
    
    import qualified Data.Attoparsec.ByteString as AB
    import qualified Data.Attoparsec.Internal.Types as AIT
    import qualified Data.Attoparsec.Text as AT
    import Data.ByteString (ByteString)
    import qualified Data.ByteString.Internal as BI
    import Data.Text (Text)
    import Data.Text.Encoding (decodeUtf8, encodeUtf8)
    import qualified Data.Text.Internal as TI
    import Unsafe.TrueName
    
    bsToTextState :: AIT.State ByteString -> AIT.State Text
    bsToTextState = bufferText . decodeUtf8 . unbufferBS where
        unbufferBS :: AIT.State ByteString -> ByteString
        unbufferBS [truename| ''AIT.State
            Data.Attoparsec.ByteString.Buffer.Buffer
            Buf | fp off len _ _ |] = BI.PS fp off len
        bufferText :: Text -> AIT.State Text
        bufferText (TI.Text arr off len) = [truename| ''AIT.State
            Data.Attoparsec.Text.Buffer.Buffer
            Buf |] arr off len len 0
    
    textToBSState :: AIT.State Text -> AIT.State ByteString
    textToBSState = bufferBS . encodeUtf8 . unbufferText where
        unbufferText :: AIT.State Text -> Text
        unbufferText [truename| ''AIT.State
            Data.Attoparsec.Text.Buffer.Buffer
            Buf | arr off len _ _ |] = TI.Text arr off len
        bufferBS :: ByteString -> AIT.State ByteString
        bufferBS (BI.PS fp off len) = [truename| ''AIT.State
            Data.Attoparsec.ByteString.Buffer.Buffer
            Buf |] fp off len len 0
    
    mapIResult :: (i -> j) -> (j -> i) -> AIT.IResult i a -> AIT.IResult j a
    mapIResult f g = go where
        go = \case
            AIT.Fail i ctx msg -> AIT.Fail (f i) ctx msg
            AIT.Partial k -> AIT.Partial (go . k . g)
            AIT.Done i r -> AIT.Done (f i) r
    
    mapFailure :: (i -> j) -> (j -> i) -> (AIT.State j -> AIT.State i) ->
        AIT.Failure i (AIT.State i) r -> AIT.Failure j (AIT.State j) r
    mapFailure f g h k st p m ctx msg = mapIResult f g $ k (h st) p m ctx msg
    
    mapSuccess :: (i -> j) -> (j -> i) -> (AIT.State j -> AIT.State i) ->
        AIT.Success i (AIT.State i) a r -> AIT.Success j (AIT.State j) a r
    mapSuccess f g h k st p m a = mapIResult f g $ k (h st) p m a
    
    bsToTextParser :: AB.Parser a -> AT.Parser a
    bsToTextParser (AIT.Parser bsP) = AIT.Parser textP where
        textP st p m f s = mapIResult decodeUtf8 encodeUtf8 $ bsP
            (textToBSState st) p m
            (mapFailure encodeUtf8 decodeUtf8 bsToTextState f)
            (mapSuccess encodeUtf8 decodeUtf8 bsToTextState s)
    
    textToBSParser :: AT.Parser a -> AB.Parser a
    textToBSParser (AIT.Parser textP) = AIT.Parser bsP where
        bsP st p m f s = mapIResult encodeUtf8 decodeUtf8 $ textP
            (bsToTextState st) p m
            (mapFailure decodeUtf8 encodeUtf8 textToBSState f)
            (mapSuccess decodeUtf8 encodeUtf8 textToBSState s)
    

    {,un}buffer{BS,Text} are adapted from the respective internal modules Data.Attoparsec.{ByteString,Text}.Buffer.

    Was a good excuse for me to update true-name to work with more recent GHC though. Depending on how up-to-date you are, you may need the WIP from GitHub.

    It's probably not terrible for performance, as long as you keep in mind that each time you use textToBSParser, the entire input gets fed through encodeUtf8 with any leftover converted back via decodeUtf8, and vice versa for bsToTextParser. If you only convert a Parser once at the top-level, it shouldn't be too different from simply converting the input as the other answer suggests.

    PS: I haven't tested this beyond

    $ ghci -XOverloadedStrings parsers.hs 
    *Parsers> textToBSParser AT.scientific `AB.parseTest` "123 "
    Done " " 123.0
    

    PPS: for your own parsers, you might be able to leverage OverloadedStrings and write p :: IsString s => AIT.Parser s a instead, with {-# SPECIALISE p :: AT.Parser a #-} pragmas. I've not explored how workable this idea is.