Given parser combinators as defined by libraries such as Parsec, Attoparsec or various other functional implementations, is it possible to parse languages such as C or Haskell themselves?
Here is an example of what I have in mind:
-- constructor defined by its name, and a list of arguments
data Constructor = Constructor String [Type]
-- type definition contains a type name, list of type variables, and a list of constructors
data Type = Type String [Char] [Constructor]
In this very simplified example, parsing of a type could be:
typeParser :: Parser Type
typeParser = do
string "data"
spaces
name <- takeWhile letters
spaces
typeVars <- many1 letter
...
I noticed that the package http://hackage.haskell.org/package/haskell-src-1.0.3.1 parses the Haskell 98 language, but it does not depend on any of the parser combinator libraries.
Some programming languages like Haskell are not fully context-free. This means that some contextual information is needed in order to parse them. Haskell is not fully context-free because it is indentation-sensitive.
Some monadic parser combinator libraries like Parsec and Megaparsec allow for more easily parsing context-sensitive languages. Parsec'sParsecT
and Parsec
types can keep track of contextual information, which the library refers to as "user state", which allows for parsing the context-sensitive parts of languages like indentation level. The "user state" can be accessed through the getState
, putState
, and modifyState
functions. The tricky part is mixing parsers that have "user states" of different types.
It is possible to use approaches other than monadic parser combinators, however they are often more limited and/or less straightforward and can require more workarounds to get them working. For example, a parser generator library like Flex/Bison could be used to parse the context-free parts of Haskell, however a workaround would be required to parse the context-sensitive parts because parser generator libraries can only parse context-free languages.