parsingmarkdowncommonmark

CommonMark Parsing ***


Let's say I want to parse the string ***cat*** into Markdown using the CommonMark standard. The standard says (http://spec.commonmark.org/0.28/#phase-2-inline-structure):

....

If one is found:

Figure out whether we have emphasis or strong emphasis: if both closer and opener spans have length >= 2, we have strong, otherwise regular.

Insert an emph or strong emph node accordingly, after the text node corresponding to the opener.

Remove any delimiters between the opener and closer from the delimiter stack.

Remove 1 (for regular emph) or 2 (for strong emph) delimiters from the opening and closing text nodes. If they become empty as a result, remove them and remove the corresponding element of the delimiter stack. If the closing node is removed, reset current_position to the next element in the stack.

....

Based on my reading of this the result should be <em><strong>cat</strong></em> since first the <strong> is added, THEN the <em>. However, all online markdown editors I have tried this in output <strong><em>cat</em></strong>. What am I missing?

Here is a visual representation of what I think should be happening

TextNode[***] TextNode[cat] TextNode[***]

TextNode[*] StrongEmphasis TextNode[cat] TextNode[*]

TextNode[] Emphasis StrongEmphasis TextNode[cat] TextNode[]

Emphasis StrongEmphasis TextNode[cat]


Solution

  • It's important to remember that Commonmark and Markdown are not necessarily the same thing. Commonmark is a recent variant of Markdown. Most Markdown parsers existed and established their behavior long before the Commonmark spec was even started.

    While the original Markdown rules make no comment on whether the <em> or <strong> tag should be first in the given example, the reference implementation's (markdown.pl) actual behavior was to list the <strong> tag before the <em> tag in the output. In fact, the MarkdownTest package, which was created by the author of Markdown and markdown.pl) explicitly required that output (the original is no longer available online that I know of, but mdtest is a faithful copy with its history showing no modifications of that test since the initial import from MarkdownTest). AFAICT, every (non-Commonmark) Markdown parser has followed that behavior exactly.

    The Commonmark spec took a different route. The spec specifically states in Rule 14 of Section 6.4 (Emphasis and strong emphasis):

    An interpretation <em><strong>...</strong></em> is always preferred to <strong><em>...</em></strong>.

    ... and backs it up with example 444:

    ***foo***
    
    <p><em><strong>foo</strong></em></p>
    

    In fact, you can see that that is exactly the behavior of the reference implementation of Commonmark.

    As an aside, the original question quotes from the Appendix to the spec which recommends how to implement a parser. While potentially useful to a parser creator, I would not recommend using that section to determine proper syntax handling and/or output. The actual rules should be consulted instead; and in fact, they clearly provide the expected output in this instance. But this question is about an apparent disparity between implementations and the spec, not interpretation of the spec.

    For a more complete comparison, see Babelmark. With the exception of a few (completely) broken implementations, every "classic" Markdown parser follows markdown.pl, while every Commonmark parser follows the Commonmark spec. Therefore, there is no actual disparity between the spec and implementations. The disparity is between Markdown and Commonmark.

    As for why the Commonmark authors chose a different route in this regard, or why they insist on calling Commonmark "Markdown" when it is clearly different are off topic here and better asked of the authors themselves.