haskellparsecparser-combinators

Parse Haskell itself with parser combinators


Given parser combinators as defined by libraries such as Parsec, Attoparsec or various other functional implementations, is it possible to parse languages such as C or Haskell themselves?

Here is an example of what I have in mind:

-- constructor defined by its name, and a list of arguments           
data Constructor = Constructor String [Type]

-- type definition contains a type name, list of type variables, and a list of constructors
data Type = Type String [Char] [Constructor] 

In this very simplified example, parsing of a type could be:

typeParser :: Parser Type
typeParser = do
  string "data"
  spaces
  name <- takeWhile letters
  spaces
  typeVars <- many1 letter
  ...

I noticed that the package http://hackage.haskell.org/package/haskell-src-1.0.3.1 parses the Haskell 98 language, but it does not depend on any of the parser combinator libraries.


Solution

  • TL;DR Yes, you can parse Haskell using a monadic parser combinator library like Parsec.

    Some programming languages like Haskell are not fully context-free. This means that some contextual information is needed in order to parse them. Haskell is not fully context-free because it is indentation-sensitive.

    Some monadic parser combinator libraries like Parsec and Megaparsec allow for more easily parsing context-sensitive languages. Parsec'sParsecT and Parsec types can keep track of contextual information, which the library refers to as "user state", which allows for parsing the context-sensitive parts of languages like indentation level. The "user state" can be accessed through the getState, putState, and modifyState functions. The tricky part is mixing parsers that have "user states" of different types.

    It is possible to use approaches other than monadic parser combinators, however they are often more limited and/or less straightforward and can require more workarounds to get them working. For example, a parser generator library like Flex/Bison could be used to parse the context-free parts of Haskell, however a workaround would be required to parse the context-sensitive parts because parser generator libraries can only parse context-free languages.