Regex for Markdown Table Syntax?

I'm currently developing a tool that allows me to parse Github wikis; I'm trying to add support for Markdown tables, which are not supported by the parser I'm using.

I'm a bit stuck with the complicated table syntax. The official specification is here:

| Left align | Right align | Center align |
|:-----------|------------:|:------------:|
| This       |        This |     This     |
| column     |      column |    column    |
| will       |        will |     will     |
| be         |          be |      be      |
| left       |       right |    center    |
| aligned    |     aligned |   aligned    |

As you can see there's some structure but some parts are entirely optional.

I would like a regex that would capture the header (first line), the column alignment data (second line) and actual content as separate groups. It should contain at least one content line in order to match. The header and alignment data also has to obey certain rules as seen on the examples.

It's possible my approach is misguided (perhaps regex can be avoided?). If so, any answers leading to the same results easier are appreciated.

Solution

Wasn't fully satisfied with some other answers here because they include whitespace before or after the table, or don't handle e.g. multiple tables separated by one new line. But building off of those answers, here's what I came up with:

((?:\| *[^|\r\n]+ *)+\|)(?:\r?\n)((?:\|[ :]?-+[ :]?)+\|)((?:(?:\r?\n)(?:\| *[^|\r\n]+ *)+\|)+)/g

This will result in 3 captured groups which are useful

header row ((?:\| *[^|\r\n]+ *)+\|)
alignment row ((?:\|[ :]?-+[ :]?)+\|)
body (?:\| *[^|\r\n]+ *)+\|)+)

I did this for a custom remark plugin. Below is an example typescript implementation.

const input = `
### Table 1: Fruit Information

| Fruit     | Color    | Taste  | Seasonal Availability |
|-----------|----------|--------|-----------------------|
| Apple     | Red      | Sweet  | Fall                  |
| Banana    | Yellow   | Sweet  | All Year              |
| Orange    | Orange   | Citrus | Winter                |
| Strawberry| Red      | Sweet  | Spring                |
| Grape     | Purple   | Sweet/Tart | Fall              |

### Table 2: Countries and Capitals

| Country    | Capital       | Population (millions) |
|------------|---------------|-----------------------|
| USA        | Washington D.C.| 331                   |
| Canada     | Ottawa        | 38                    |
| Germany    | Berlin        | 83                    |
| Japan      | Tokyo         | 126                   |
| Australia  | Canberra      | 25                    |

some text after
`

function splitRow(row: string) {
  return row
    .trim()
    .replace(/^\||\|$/g, "") // Remove leading and trailing pipes
    .split("|")
    .map((cell) => cell.trim());
}

const tableRegex =
        /((?:\| *[^|\r\n]+ *)+\|)(?:\r?\n)((?:\|[ :]?-+[ :]?)+\|)((?:(?:\r?\n)(?:\| *[^|\r\n]+ *)+\|)+)/g;
let match: RegExpExecArray | null;

while ((match = tableRegex.exec(input)) !== null) {
  const fullTableString = match[0];
  const headerGroup = match[1];
  const separatorGroup = match[2];
  const bodyGroup = match[3];

  if (!fullTableString || !headerGroup || !separatorGroup || !bodyGroup) {
     console.error("Markdown table regex failed to yield table groups");
     break;
  }

  const headerCells = splitRow(headerGroup);
  const alignments = splitRow(separatorGroup).map((cell) => {
    if (cell.startsWith(":") && cell.endsWith(":")) return "center";
    if (cell.endsWith(":")) return "right";
    if (cell.startsWith(":")) return "left";
    return null;
  });

  const bodyRows = bodyGroup
    .trim()
    .split("\n")
    .map((bodyRow) => splitRow(bodyRow.trim()));
    enter code here

  console.log("TABLE FOUND", fullTableString, headerCells, alignments, bodyRows)
}

Test Regex on regex101.com

Run on TS Playground