I'm trying to parse the following:
message: 123 test
abc xys
messageA: hmm
messageA: testing
messageB: aueo
Into something like:
("message", "123 test\nabcxyz"),
, ("messageA", "hmm")
, ("messageA","testing")
, ("messageB","aueo\nqkhwueoaz")
However I just can't seem to figure this out, I'm finding some difficulty in that I'm not 100% familiar with attoparsecs functionality (and I can't really see each function being documented as to whether it moves the cursor forward...).
I've read through: Multi-line *non* match with attoparsec and I've got the following code:
isChrisNext :: Parser ()
isChrisNext = lookAhead (parseChris) *> pure()
notFollowedBy :: Monad m => m a -> m b
notFollowedBy p = p >> fail "not followed by"
restOfLine :: Parser Text
restOfLine = do
rest <- takeTill (== '\n')
isEOF <- atEnd
if isEOF then
return rest
(char '\n') >> return rest
parseChris :: Parser [Text]
parseChris = do
x <- takeWhile1 (notInClass ":")
_ <- string ":"
x' <- manyTill restOfLine (endOfInput <|> isChrisNext)
() <- return $ unsafePerformIO $! do
print "?????????????"
print x
print x'
return $ x : x'
However trying to parse the data with parseChris
just returns:
[ "message" ]
while I'm expecting ("message", "123 test\nabcxyz")
If I change the lookahead function to:
isChrisNext :: Parser ()
isChrisNext = lookAhead (string "message:") *> pure()
I get a more intended output of:
[ "message"
, "123 test"
, "abc xys"
In addition, the question mentioned earlier also has a comment suggesting an approach of:
Just parse the log times apart by matching on time stamps, and only within each time-entry parse the sub-entries.
I'm also aware of a potential issue where a second line could contain a :
, but this is not something I need to take into consideration thankfully...
The approach I find really useful when working with parser combinators is to break down the whole problem into smaller pieces. So, I'd just compose the parser bottom-up: a keyValuePair
first, and then the whole parser consisting just of many keyValuePair
s. The keyValuePair
consumes the part before :
and then just eats as many lines without :
as it can.
In code:
{-# LANGUAGE OverloadedStrings #-}
import qualified Data.ByteString.Char8 as BS
import qualified Data.Attoparsec.ByteString.Char8 as AT
import Control.Applicative
import Data.Functor
valuePart :: AT.Parser BS.ByteString
valuePart = AT.takeTill (`BS.elem` ":\n") <* AT.endOfLine
keyValuePair :: AT.Parser (BS.ByteString, BS.ByteString)
keyValuePair = do
key <- AT.takeTill (== ':')
void ": "
valLines <- AT.many1 valuePart
pure (key, BS.intercalate "\n" valLines)
parser :: AT.Parser [(BS.ByteString, BS.ByteString)]
parser = many keyValuePair
Running on your input data produces
*Main> AT.parseOnly parser test
Right [("message","123 test\nabc xys"),("messageA","hmm"),("messageA","testing"),("messageB","aueo")]
Note there is no lookahead, as there is no need for it: as soon as valuePart
encounters a :
, it just fails, which causes keyValuePair
to stop and the next keyValuePair
to get run by the top-level many
in parser
BTW you can use trace
and traceShow
from Debug.Trace
instead of unsafePerformIO
to produce debugging output.