htmlmarkdowncommonmark

How are tabs interpreted in CommonMark?


See the description before Example 6 in the CommonMark spec at: http://spec.commonmark.org/0.27/#example-5

I am trying to understand how the following code leads to a code-block starting with two spaces.

>→→foo

Example 6 shows that this would translate to the following.

<blockquote>
<pre><code>  foo
</code></pre>
</blockquote>

But Section 2.2 clearly states:

However, in contexts where whitespace helps to define block structure, tabs behave as if they were replaced by spaces with a tab stop of 4 characters.

So as per my understanding, the above Markdown behaves like the following (I denote a space with a dot).

>........foo

Since, one optional space is allowed after >, and 4 spaces are used to indent code block, we are left with,

>...foo

That's a code-block starting with three spaces. How does CommonMark claim then that it should lead to a code-block starting with two spaces? What am I missing?


Solution

  • The key is in the very first paragraph of the Tabs section (emphasis added):

    Tabs in lines are not expanded to spaces. However, in contexts where whitespace helps to define block structure, tabs behave as if they were replaced by spaces with a tab stop of 4 characters.

    Notice that is says "4 characters" not 4 spaces.

    If you configure your text editor to use a tab stop of length four and to replace tabs with spaces (any good text editor should offer this setting), the text editor will use columns that are four characters wide. When you press the tab key, it will forward the cursor to the next column, which will only every be four characters wide. If the column already contains any characters, then only as many spaces are added to total four characters, which, in this case would be less than four spaces.

    For example, if you type an angle bracket (>) character in your editor and then press tab, you will get the following (when configured to replace tabs with spaces):

    >···
    

    Therefore the angle bracket plus the tab moves forward to the end of the column (four characters) for a total of three spaces. As we are now at the beginning of the next column, pressing tab a second time would move us to the next column (4 more spaces) for a total of 7 spaces:

    >·······
    

    We can confirm this is the correct interpretation with a more recent change to the spec committed in 3bc01c5dc (which apparently hasn't made it it to a release yet). As the commit comment suggests, the clarification helps the math make more sense (emphasis added):

    Normally the > that begins a block quote may be followed optionally by a space, which is not considered part of the content. In the following case > is followed by a tab, which is treated as if it were expanded into three spaces. Since one of these spaces is considered part of the delimiter, foo is considered to be indented six spaces inside the block quote context, so we get an indented code block starting with two spaces.

    Notice the added sentence (in bold) which confirms that the first tab only adds "three spaces".

    Therefore, as we have now established, we start with an angle bracket plus seven spaces. So first we break off the blockquote deliminator, which consists of the angle bracket and the first space (in the following examples the | is used to indicate where the parser breaks the string and should not be counted as characters):

    >·|······
    

    The text contained in the blockquote is now indented six spaces. Four of them are the code block deliminator:

    >·|····|··
    

    Which leaves two spaces at the start of the code block.

    Of course, as stated back at the beginning (of the section in the spec), the tabs aren't actually replaced with spaces, it just behaves as if they were. And that can be confusing at times. It may help to configure your text editor to always replace tabs with spaces and then you can avoid this confusion.