haskellattoparsec

Matching values that carry onto multiple following lines with attoparsec


I'm trying to parse the following:

message: 123 test
abc xys
messageA: hmm
messageA: testing
messageB: aueo
qkhwueoaz

Into something like:

[
  ("message", "123 test\nabcxyz"),
, ("messageA", "hmm")
, ("messageA","testing")
, ("messageB","aueo\nqkhwueoaz")
]

However I just can't seem to figure this out, I'm finding some difficulty in that I'm not 100% familiar with attoparsecs functionality (and I can't really see each function being documented as to whether it moves the cursor forward...).

I've read through: Multi-line *non* match with attoparsec and I've got the following code:

isChrisNext :: Parser ()
isChrisNext = lookAhead (parseChris) *> pure()

notFollowedBy :: Monad m => m a -> m b
notFollowedBy p = p >> fail "not followed by"

restOfLine :: Parser Text
restOfLine = do
    rest <- takeTill (== '\n')
    isEOF <- atEnd
    if isEOF then
        return rest
    else
        (char '\n') >> return rest

parseChris :: Parser [Text]
parseChris = do
  x <- takeWhile1 (notInClass ":")
  _ <- string ":"
  x' <- manyTill restOfLine (endOfInput <|> isChrisNext)
  () <- return $ unsafePerformIO $! do
    print "?????????????"
    print x
    print x'
  return $ x : x'

However trying to parse the data with parseChris just returns: [ "message" ] while I'm expecting ("message", "123 test\nabcxyz").

If I change the lookahead function to:

isChrisNext :: Parser ()
isChrisNext = lookAhead (string "message:") *> pure()

I get a more intended output of:

[ "message"
, "123 test"
, "abc xys"         
] 

In addition, the question mentioned earlier also has a comment suggesting an approach of:

Just parse the log times apart by matching on time stamps, and only within each time-entry parse the sub-entries.

I'm also aware of a potential issue where a second line could contain a :, but this is not something I need to take into consideration thankfully...


Solution

  • The approach I find really useful when working with parser combinators is to break down the whole problem into smaller pieces. So, I'd just compose the parser bottom-up: a keyValuePair first, and then the whole parser consisting just of many keyValuePairs. The keyValuePair consumes the part before : and then just eats as many lines without : as it can.

    In code:

    {-# LANGUAGE OverloadedStrings #-}
    
    import qualified Data.ByteString.Char8 as BS
    import qualified Data.Attoparsec.ByteString.Char8 as AT
    import Control.Applicative
    import Data.Functor
    
    valuePart :: AT.Parser BS.ByteString
    valuePart = AT.takeTill (`BS.elem` ":\n") <* AT.endOfLine
    
    keyValuePair :: AT.Parser (BS.ByteString, BS.ByteString)
    keyValuePair = do
        key <- AT.takeTill (== ':')
        void ": "
        valLines <- AT.many1 valuePart
        pure (key, BS.intercalate "\n" valLines)
    
    parser :: AT.Parser [(BS.ByteString, BS.ByteString)]
    parser = many keyValuePair
    

    Running on your input data produces

    *Main> AT.parseOnly parser test
    Right [("message","123 test\nabc xys"),("messageA","hmm"),("messageA","testing"),("messageB","aueo")]
    

    Note there is no lookahead, as there is no need for it: as soon as valuePart encounters a :, it just fails, which causes keyValuePair to stop and the next keyValuePair to get run by the top-level many in parser.

    BTW you can use trace and traceShow from Debug.Trace instead of unsafePerformIO to produce debugging output.