I'm trying to make a custom syntax highlighter for my own markup language. All the examples are complicated, missing steps and are very hard to understand. For example, this video, which has an extremely large skip in the middle and doesn't really explain much.
Is there anything that fully documents how to make a syntax highlighter for VS Code?
My current code, made with Yeoman generator is:
{
"$schema": "https://raw.githubusercontent.com/martinring/tmlanguage/master/tmlanguage.json",
"name": "BetterMarkupLanguage",
"patterns": [
{
"include": "#keywords"
},
{
"include": "#strings"
}
],
"repository": {
"keywords": {
"patterns": [{
"name": "entity.other.bml",
"match": "\\b({|}|\\\\|//)\\b"
}]
},
"strings": {
"name": "string.quoted.double.bml",
"begin": "`",
"end": "`"
}
},
"scopeName": "source.bml"
}
VS Code uses a tmLanguage
engine, which means you can write a syntax highlighter at one of two complexities:
If you only want to do the former, that's still better than nothing at all, but I'm going to cover both types in this answer. The two levels of complexity correspond directly to the two basic jobs for a tmLanguage
engine. These are:
After scopes have been assigned by your syntax highlighter, the user's Color Theme maps them to colors and styles to apply to the text. (Using established conventions for your scopes helps preserve consistency for colors and styles across multiple languages that a user may have installed.)
Lets say you make a definition for integers with this snippet:
"integers": {
"patterns": [{
"name": "constant.numeric.integer.bml",
"match": "[+-]\\d+"
}]
},
When the engine encounters text matching the regex pattern in "match"
, it will select it, assign the scope from "name"
, and then continue looking for text that matches any regex in the combined list of patterns for the context at the tip of the stack.
Compare the integer definition to your "strings" one:
"strings": {
"name": "string.quoted.backtick.bml",
// The "string.quoted.double.bml" scope used by the question is for strings
// bracketed by double-quotes.
"begin": "`",
"end": "`",
// Ideally, you also use the begin/end captures to add "punctuation" scopes
// to the backticks themselves. See the further reference links at the end.
"beginCaptures": {
"0": "punctuation.definition.string.begin.backtick.bml"
},
"endCaptures": {
"0": "punctuation.definition.string.end.backtick.bml"
}
},
Those "begin"
and "end"
markers denote a change in the tmLanguage
stack. You have pushed into a new context inside of a string. Right now, there are no patterns configured to match within this context, but you could do that by adding a "patterns"
key.
Think about escaped backticks: You want to scope those as constant.character.escape.bml
and stay in the same context within "strings". You don't want an escaped backtick to leave the string context prematurely. Here's an example that assumes double-backticks are escaped:
"strings": {
"name": "string.quoted.backtick.bml",
"begin": "`",
"end": "`",
"patterns": [{
"name": "constant.character.escape.bml",
"match": "``"
}]
},
Obviously, if your language uses \
to escape characters, modify the "match"
correspondingly. (And don't forget that the JSON needs to escape the \
s, too!)
You'll eventually notice that the first pattern encountered is matched. Remember the integers set? What happens when you have 45.125
? It will decide to match the 45
and the 125
as integers and ignore the .
entirely. If you have a "floats" pattern, you want to include that before your naïve integer pattern.
"numbers": {
"patterns": [{
"name": "constant.numeric.float.bml",
"match": "[+-]\\d+\\.\\d*"
}, {
"name": "constant.numeric.integer.bml",
"match": "[+-]\\d+"
}]
},
There's another way to write the "numbers" snippet above. A pattern can include other other lists of patterns, even interspersed with its own matches to preserve the order of regex comparison.
This allows you to name and re-use floats and integers in other parts of your highlighter. Notice that the "include"
for floats comes first, so they will match first inside "numbers" even though they are defined after "integers" in the file.
"numbers": {
"patterns": [
{"include": "#floats"},
{"include": "#integers"}
]
},
"integers": {
"patterns": [{
"name": "constant.numeric.integer.bml",
"match": "[+-]\\d+"
}]
},
"floats": {
"patterns": [{
"name": "constant.numeric.float.bml",
"match": "[+-]\\d+\\.\\d*"
}]
},
The numbers/integers/floats example was trivial, but well-designed syntax definitions will define utility groups that "include"
equivalent things together for re-usability:
A normal programming language will have things like
return
and so on.A markup language like yours might have
Though there is more you could learn (capture groups, injections, scope conventions, etc.), this is hopefully a practical overview for getting started.
When you write your syntax highlighting, think to yourself: Does matching this token put me in a place where things like it can be matched again? Or does it put me in a different place where different things (more or fewer) ought to be matched? If the latter, what returns me to the original set of matches?