parsing haskell functional-programming parsec

Using Parsec to parse a string containing a list of tuples

I am trying to parse (using parsec) a string that represents some data type that I defined. Thus the string needs to be parsed to my data type. An example of the string would be,

[(1,[(<,0),(%,4)]), (2,[(>=, 4)])]

This would parse to the following,

[(Reg 1, [Cmp (Jlt, Intv (0, 0)), Op (Mod, Intv (-4,4))]), (Reg 2, [Cmp (Jge, (4,4))])]

Now this makes use of a few custom data types,

newtype Reg = Reg Int deriving (Eq, Show, Ord)

data LF = Op (BinAlu, Interval) | Cmp (Jcmp, Interval) | Invalid
    deriving (Eq, Show, Ord)

data BinAlu
  = Add
  | Sub
  | Mul
  | Div
  | Or
  | And
  | Lsh
  | Rsh
  | Mod
  | Xor
  | Mov
  | Arsh
  deriving (Eq, Show, Ord, Enum)

data Jcmp = Jeq | Jgt | Jge | Jlt | Jle | Jset | Jne | Jsgt | Jsge | Jslt | Jsle
  deriving (Eq, Show, Ord, Enum)

data Interval = Bot | Intv (Int, Int)
  deriving (Eq, Show, Ord)

Thus I want to parse the string to be the following type [(Reg, [LF])]

Now I am quite lost on how to actually do this. I think I have an idea, but I find it difficult implementing that idea.

My idea is to first use, between (symbol "[") (symbol "]"), which hopyfully would give me the contents between [ and ]. Then I need to do something similar for the paranthesis but repeat it. And then of course parsing the contents within the paranthesis.

I am basically looking for any advice on how to setup this parser. And how in general to structure such a parser.

Any help is greatly appreciated!

Solution

The following should get you started. We'll need some imports:

module TupleParser where
import Text.Parsec
import Text.Parsec.Char
import Text.Parsec.String

In order to properly process whitespace, you should begin by writing some combinators to handle "lexemes", parsers that expect to start at a non-whitespace character, parse something, and discard trailing whitespace. While Parsec has some lexeme support in Text.Parsec.Token, it's over-designed and hard to use. Here's a simplified alternative, based on the Megaparsec approach:

-- a lexeme starts on non-whitespace, parses something,
-- and discards trailing whitespace
lexeme :: Parser a -> Parser a
lexeme p = p <* spaces

-- a symbol is a verbatim string, treated as a lexeme
symbol :: String -> Parser String
symbol s = lexeme (string s)

The following are pretty standard lexemes for parsing numbers:

-- an unsigned decimal number, treated as a lexeme
decimal :: (Read n, Integral n) => Parser n
decimal = lexeme (read <$> many1 digit)

-- combinator for signed numbers; replace "string" with
-- "symbol" if you want to allow space between dash and
-- first digit
signed :: (Num n) => Parser n -> Parser n
signed p = option id (negate <$ string "-") <*> p

And here are some other pretty standard lexemes/combinators:

-- some standard names
comma :: Parser String
comma = symbol ","

parens :: Parser p -> Parser p
parens = between (symbol "(") (symbol ")")

brackets :: Parser p -> Parser p
brackets = between (symbol "[") (symbol "]")

Here's a helper for lists, since you use it in a couple places.

-- a list is a bracket-delimited, comma-separated list
listOf :: Parser p -> Parser [p]
listOf p = brackets (p `sepBy` comma)

Now, we should define the lowest-level "atoms" of your grammar:

-- (insert your data types here)

reg :: Parser Reg
reg = Reg <$> decimal

lf :: Parser LF
lf = parens
  $   Op <$> ((,) <$> binalu <* comma <*> interval)
  <|> Cmp <$> ((,) <$> jcmp <* comma <*> interval)
  <|> Invalid <$ symbol "???"

-- I don't really understand your interval syntax, so
-- I'm just parsing any number "n" into "Intv (n,n)"
interval :: Parser Interval
interval = (\x -> Intv (x,x)) <$> signed decimal

For binalu and jcmp, a simple first attempt might look like this:

binalu :: Parser BinAlu
binalu
  =   Mod <$ symbol "%"
  -- etc.

jcmp :: Parser Jcmp
jcmp
  =   Jlt <$ symbol "<"
  <|> Jge <$ symbol ">="
  -- etc.

and this is sufficient to parse your example input. However, there's a problem here when you flesh these out with all your desired operators. The parser symbol "<" for example, will happily parse the first character of "<=", leaving the "=" to cause an error when you try to parse the comma next. If you order the alternatives to try "<=" first:

jcmp :: Parser Jcmp
jcmp
  =   Jle <$ symbol "<="
  <|> Jlt <$ symbol "<"
  -- etc.

this still isn't sufficient, because symbol "<=" will happily start parsing a "<" not followed by a "=" and then "fail after consuming input", which prevents any later alternatives from being tried. You can use the try combinator to continue anyway:

jcmp :: Parser Jcmp
jcmp
  =   try (Jle <$ symbol "<=")
  <|> Jlt <$ symbol "<"
  -- etc.

but this is tedious to get right. The usual solution is to define a list of "operator characters":

-- include every character the appears in one of your operators
opChars :: String
opChars = "+-*/|&<=>%^!"

and define an operator combinator (note: Parsec calls this combinator reservedOp) that parses an operator followed by something other than an operator character:

operator :: String -> Parser String
operator s = lexeme $ try (string s <* notFollowedBy (oneOf opChars))

Now, you can list operators in any order, and they'll work fine:

jcmp :: Parser Jcmp
jcmp
  =   Jle <$ operator "<="
  <|> Jlt <$ operator "<"
  <|> Jgt <$ operator ">"
  <|> Jge <$ operator ">="
  -- etc.

Finally, we can define the grammar for your higher-level structures. Note that the topmost parser should ignore leading whitespace, as all the lexeme parsers expect to start with non-whitespace, and check for end-of-input.

type Program = [Statement]
type Statement = (Reg, [LF])

program :: Parser Program
program = spaces *> listOf statement <* eof

statement :: Parser Statement
statement = parens $ (,) <$> reg <* comma <*> listOf lf

Here's a test on your proposed input:

main = parseTest program "[(1,[(<,0),(%,4)]), (2,[(>=, 4)])]"

which should produce the output:

[(Reg 1,[Cmp (Jlt,Intv (0,0)),Op (Mod,Intv (4,4))]),(Reg 2,[Cmp (Jge,Intv (4,4))])]

Full code:

module TupleParser where

import Text.Parsec
import Text.Parsec.Char
import Text.Parsec.String

lexeme :: Parser a -> Parser a
lexeme p = p <* spaces

symbol :: String -> Parser String
symbol s = lexeme (string s)

 -- characters appearing in operators
opChars :: String
opChars = "+-*/|&<=>%^!"

-- parse an operator
operator :: String -> Parser String
operator s = lexeme $ try (string s <* notFollowedBy (oneOf opChars))

decimal :: (Read n, Integral n) => Parser n
decimal = lexeme (read <$> many1 digit)

signed :: (Num n) => Parser n -> Parser n
signed p = option id (negate <$ string "-") <*> p

comma :: Parser String
comma = symbol ","

parens :: Parser p -> Parser p
parens = between (symbol "(") (symbol ")")

brackets :: Parser p -> Parser p
brackets = between (symbol "[") (symbol "]")

listOf :: Parser p -> Parser [p]
listOf p = brackets (p `sepBy` comma)

newtype Reg = Reg Int deriving (Eq, Show, Ord)

data LF = Op (BinAlu, Interval) | Cmp (Jcmp, Interval) | Invalid
    deriving (Eq, Show, Ord)

data BinAlu
  = Add
  | Sub
  | Mul
  | Div
  | Or
  | And
  | Lsh
  | Rsh
  | Mod
  | Xor
  | Mov
  | Arsh
  deriving (Eq, Show, Ord, Enum)

data Jcmp = Jeq | Jgt | Jge | Jlt | Jle | Jset | Jne | Jsgt | Jsge | Jslt | Jsle
  deriving (Eq, Show, Ord, Enum)

data Interval = Bot | Intv (Int, Int)
  deriving (Eq, Show, Ord)

reg :: Parser Reg
reg = Reg <$> decimal

lf :: Parser LF
lf = parens
  $   Op <$> ((,) <$> binalu <* comma <*> interval)
  <|> Cmp <$> ((,) <$> jcmp <* comma <*> interval)
  <|> Invalid <$ symbol "???"

binalu :: Parser BinAlu
binalu
  =   Mod <$ operator "%"
  -- etc.

jcmp :: Parser Jcmp
jcmp
  =   Jlt <$ operator "<"
  <|> Jge <$ operator ">="
  -- etc.

-- I don't really understand your interval syntax, so
-- I'm just parsing any number "n" into "Intv (n,n)"
interval :: Parser Interval
interval = (\x -> Intv (x,x)) <$> signed decimal

type Program = [Statement]
type Statement = (Reg, [LF])

program :: Parser Program
program = spaces *> listOf statement <* eof

statement :: Parser Statement
statement = parens $ (,) <$> reg <* comma <*> listOf lf

main = parseTest program "[(1,[(<,0),(%,4)]), (2,[(>=, 4)])]"