Implementing "includes" when parsing in Attoparsec

I am writing a DSL for fun. I decided to use attoparsec because I was familiar with it.

I want to implement parsing of includes with relative filenames like this:

include /some/dir/file.ext

or URLs:

include http://blah.com/my/file.ext

So when I'm parsing I expect to read the referenced resource and parse the entire thing, appending its contents to the "outer" parsing state.

The problem is that although the parsing of these statements is easy, I can't run IO (as I understand it) within my Attoparsec parsers.

How do I use Attoparsec to achieve this? Do I chop the initial input up using some string filtering and then parse each "block" into parse and feed accordingly? Essentially a two-pass parse approach?

Solution

Attoparsec is pure (Data.Attoparsec.Internal.Types.Parser is not a transformer and doesn’t include IO) so you’re right that you can’t expand includes from within a parser directly.

Splitting the parser into two passes seems like the right approach: one pass acts like the C preprocessor, accepting a file with include statements interleaved with other stuff. The “other stuff” only needs to be basically lexically valid, not your full parser—just like the C preprocessor only cares about tokens and matching parentheses, not matching other brackets or anything semantic. You then replace the includes, producing a fully expanded file that you can give to your existing parser.

If an included file must be syntactically “standalone” in some sense^†, then you can parse a whole file first, interleaved with includes, then replace them. For instance:

-- Whatever items you’re parsing.
data Item

-- A reference to an included path.
data Include = Include FilePath

parse :: Parser [Either Include Item]

-- Substitute includes; also calls ‘parse’
-- recursively until no includes remain.
substituteIncludes :: [Either Include Item] -> IO [Item]

^† Say, if you’re just using attoparsec for lexing tokens that can’t cross file boundaries anyway, or you’re doing full parsing but want to disallow an include file that contains e.g. unmatched brackets.

The other option is to embed IO in your parser directly by using a different parsing library such as megaparsec, which provides a ParsecT transformer that you can wrap around IO to do IO directly in your parser. I would probably do this for a prototype, but it seems tidier to separate the concerns of parsing and expansion as much as possible.