parsingmarkdowngrammarcontext-free-grammar

Can markdown grammar be parsed using a CFG?


Markdown supports nested lists, indicated by the length of whitespace (the indent) at the beginning of each line. This is similar to Python. It seems hard to construct a parser using a context free grammar (CFG) for it. Is the grammar context sensitive and not context free?


Solution

  • There are quite a few threads that claim it is not. A good looking explanation is: https://roopc.net/posts/2014/markdown-cfg/ by Roopesh Chander:

    consider the case of emphasis. *a* is em and **a** is strong. If we wrote a (very simplified) CFG in an EBNF-like format for Markdown text runs (i.e. contents of Markdown paragraphs), it would look like this:

    text-run    = span text-run | span
    span        = strong | em | normal-text
    strong      = "**" text-run "**"
    em          = "*" text-run "*"
    normal-text = [a-zA-Z0-9 ]
    

    This already has ambiguities in that **a** can be interpreted by this grammar in two different ways: Just a strong run, or an em run inside another em run.

    But it gets worse. The grammar we saw above would reject the input *a, which in Markdown should be interpreted as normal text. Same case for a*, **a and a**. If we wanted to handle these too, we need these expansions as well:

    normal-text = "*" text-run | "**" text-run |
                  text-run "*" | text-run "**"
    

    And with that addition, the amount of ambiguity shoots up significantly and unmanageably.

    If we take a closer look at why this happens, we can see that when a * token is encountered, the grammar cannot decide whether it should be part of an em-qualifier or normal text (or sometimes, a strong-qualifier). That decision can be made only after scanning the rest of the input till we find a matching closing *. If there is no matching closing *, we will know that only if we scan until the end of the text-run’s string (which would be the end of the paragraph). So, for the parser to know which rule to choose at a point, it might have to look ahead an arbitrary number of tokens.

    More related threads: