Markdown supports nested lists, indicated by the length of whitespace (the indent) at the beginning of each line. This is similar to Python. It seems hard to construct a parser using a context free grammar (CFG) for it. Is the grammar context sensitive and not context free?
There are quite a few threads that claim it is not. A good looking explanation is: https://roopc.net/posts/2014/markdown-cfg/ by Roopesh Chander:
consider the case of emphasis.
*a*
is em and**a**
is strong. If we wrote a (very simplified) CFG in an EBNF-like format for Markdown text runs (i.e. contents of Markdown paragraphs), it would look like this:text-run = span text-run | span span = strong | em | normal-text strong = "**" text-run "**" em = "*" text-run "*" normal-text = [a-zA-Z0-9 ]
This already has ambiguities in that
**a**
can be interpreted by this grammar in two different ways: Just a strong run, or an em run inside another em run.But it gets worse. The grammar we saw above would reject the input
*a
, which in Markdown should be interpreted as normal text. Same case fora*
,**a
anda**
. If we wanted to handle these too, we need these expansions as well:normal-text = "*" text-run | "**" text-run | text-run "*" | text-run "**"
And with that addition, the amount of ambiguity shoots up significantly and unmanageably.
If we take a closer look at why this happens, we can see that when a
*
token is encountered, the grammar cannot decide whether it should be part of an em-qualifier or normal text (or sometimes, a strong-qualifier). That decision can be made only after scanning the rest of the input till we find a matching closing*
. If there is no matching closing*
, we will know that only if we scan until the end of the text-run’s string (which would be the end of the paragraph). So, for the parser to know which rule to choose at a point, it might have to look ahead an arbitrary number of tokens.
More related threads: