parsing markdown abstract-syntax-tree remarkjs

Remark: How to parse HTML tags and their content in MDAST

I'm trying to parse a GitHub-flavoured markdown file using Unified and Remark-Parse to generate a MDAST. I'm able to parse most of it correctly and easily, however I'm having trouble parsing the HTML tags and their content from the AST.

In the AST, HTML tags and their contents are represented as siblings, not parent-child. For example <sub>hi</sub> is parsed into

[
  {
    "type": "paragraph",
    "children": [
      {
        "type": "html",
        "value": "<sub>",
      },
      {
        "type": "text",
        "value": "hi",
      },
      {
        "type": "html",
        "value": "</sub>",
      }
    ]
  }
]

Ideally, I would want it to be parsed like

[
  {
    "type": "paragraph",
    "children": [
      {
        "type": "html",
        "value": "sub",
        "children": [
          {
            "type": "text",
            "value": "hi",
          },
        ]
      },
    ]
  }
]

so that I can access the tag type and its content. (Specifically, my goal is to just skip over the tags and their content as they are not needed for my purposes)

This is the configuration I am using currently:

import unified from 'unified';
import markdown from 'remark-parse';
import type {Block} from '@notionhq/client/build/src/api-types';
import {parseRoot} from './internal';
import gfm from 'remark-gfm';

export function parseBody(body: string): Block[] {
  const tokens = unified().use(markdown).use(gfm).parse(body);
  return parseRoot(tokens);
}

So, my question is: Is there a way of configuring Remark to do so / is there a Remark plugin to do this? If not, how would I go about creating a plugin that does so?

Thanks.

Solution

first: why the AST looks as it does and why Remark most likely does not have an option to do it differently

The reason that the AST represents it that way is because that is what the CommonMark specification specifies for raw inline HTML and for HTML blocks. Specifically, CommonMark specifies that HTML tags are passed through, not parsed.

For inline HTML, the spec supports inline HTML tags, which is not the same as supporting inline HTML. Tags are simply passed through as-is. There is no matching of opening and closing tags. The reasons for this are:

performance
parser complexity
HTML tags are only supported as a "use at your own risk" "last resort" option when Markdown doesn't have a feature you need.

For a small number of HTML tags, open and close tag matching is supported at the block-level. pre, script, style, and textarea, the latter only added recently in v0.30 of the spec.

You can read the above linked parts of the spec, and search the discussions in the CommonMark forum to get more understanding of the whys, but to get right to the point, read:

This explanation within the spec for the choices made.
Skip to [the Raw HTML section of this forum]( the https://talk.commonmark.org/t/beyond-markdown/2787?u=vas) post by the CommonMark spec author and maintainer, John MacFarlane (@jgm).
This forum question and also this one and @jgm's answers.

second: what you can do about it

Remark is "part of the unified collective", which is an infrastructure centered around the processing of AST (abstract syntax trees). From your question, it sounds like you already get this.

There is lot's of help on unified's pages for how to write plugins:

But the best way to both learn how to do this and to get a quick jump on an implementation is to look at the many existing mdast-specific manipulators.