haskellmegaparsec

Megaparsec parse many until given marker with backtracking


I'm trying to parse the following input using megaparsec (if this sounds familiar to the advent of code challenge from day 7 2022, it's because it is):

$ cd /
$ ls
dir a
123 foo.txt
dir d
$ cd a
$ ls
dir e
4567 f
78901 h.zip
$ cd e

I want to parse it into the following data structure:

data DirContent = File (Integer, String) | Directory String deriving stock (Show)


data Input
  = Cd DirectoryName
  | CdRoot -- cd /
  | CdPop -- cd ..
  | Ls [DirContent]
  deriving stock (Show)

I came up with the following parser:

import Text.Megaparsec
import Text.Megaparsec.Char

parseFull :: Parsec Void Text [Input]
parseFull = inputParser `sepBy` newline

inputParser :: Parsec Void Text Input
inputParser =
  CdRoot <$ string "$ cd /"
    <|> CdPop <$ string "$ cd .."
    <|> Cd <$> (string "$ cd " *> many alphaNumChar)
    <|> Ls <$> ((string "$ ls" <* newline) *> dirContentParser `sepBy` newline)

dirContentParser :: Parsec Void Text DirContent
dirContentParser =
  Directory <$> (string "dir " *> (pure <$> alphaNumChar))
    <|> curry File <$> decimal <* char ' ' <*> many (alphaNumChar <|> char '.')

The idea is that all the lines after a $ ls are the results of this command and should be contained in the parsed value. Therefore, the parser has to consume all lines after a $ ls until the next line starting with $ (but without consuming it, because otherwise the following parser will fail).

The parser partly works, but only if there is at most one ls command in it and only if it's the last command in the input. When I run the parser with debug mode, I get the following error:

input> IN: "$ cd /<newline>$ ls<newline>dir d<newli <…>
input> MATCH (CERR): "$ cd /<newline>$ ls<newline>dir d<newli <…>
input> ERROR:
input> offset=33:
input> unexpected "$ cd"
input> expecting "dir " or integer

as far as I can tell, the ls parser doesn't end at the next command. What I would need is a parser that stops when the next $ appears, but doesn't consume it and leave it to the next parser. I tried using try, but it didn't change anything. manyTill_ sounds great, but it consumes the token at the end.


Solution

  • Based on the suggestion in the comments, I came up with the following solution:

    parseFull :: Parsec Void Text [Input]
    parseFull = many inputParser
    
    inputParser :: Parsec Void Text Input
    inputParser =
      CdRoot <$ string "$ cd /" <* splitter
        <|> CdPop <$ string "$ cd .." <* splitter
        <|> Cd <$> (string "$ cd " *> many alphaNumChar <* splitter)
        <|> Ls <$> (string "$ ls" >> newline *> many (dirContentParser <* splitter))
      where
        splitter = optional newline
    
    dirContentParser :: Parsec Void Text DirContent
    dirContentParser =
      Directory <$> (string "dir " *> (pure <$> alphaNumChar))
        <|> curry File <$> decimal <* char ' ' <*> many (alphaNumChar <|> char '.')
    

    It doesn't use sepBy, but instead parses an optional newline at the end of each line (newline optional because otherwise the parser fails if there's no newline at the end of the input)