I'm using the lexer of Pygments, a Python plugin. I want to get tokens for a C++ code, in particular when a new variable is declared, e.g.
int a=3,b=5,c=4;
Here a,b,c should be given the type "Declared variables", which is different from
a=3,b=5,c=4;
Here a,b,c should simply be given type "Variables", since they have been declared before.
I'd like to use the ability of the lexer to scan multiple tokens at once (See Pygments documentation) I want to write a regex along the lines of
(int)(\s)(?:([a-z]+)(=)([0-9]+)(,))*, bygroups(Type,Space,Name,Equal,Number,Comma)
(The "?:" is just to tell Pygments that this grouping shouldn't be used in the bygroups.)
However, instead of matching any number of declarations in the line, it only returns tokens for the last declaration in the line (In the case, the "c=4" portion.) How can I make it return the tokens for all declarations in the line?
What you need is a stateful lexer. The reason why your regexp won't work is because the groups aren't continuous.
int a=3,b=5,c=4;
Here you want the chars 0..2 to be Type, 3..3 Space, 4..7 Name, Equal Number and Comma then again Name, Equal, Number and Comma. That's no good.
The solution is to remember when a type declaration has been seen, enter a new lexer mode which continues until the next semicolon. See Changing states in the pygments documentation.
Below is a solution that uses CFamilyLexer and adds three new lexer
states. So when it sees a line like this while in the function
state:
int m = 3 * a + b, x = /* comments ; everywhere */ a * a;
First it consumes:
int
It matches the new rule I added, so it enters the vardecl
state:
m
Oh a name of a variable! Since the lexer is in the vardecl
state,
this is a newly defined variable. Emit it as a NameDecl
token. Then
enter the varvalue
state.
3
Just a number.
*
Just an operator.
a
Oh a name of a variable! But now we are in the varvalue
state so it
is not a variable declaration, just a regular variable reference.
+ b
An operator and another variable reference.
,
Value of variable m
fully declared. Go back to the vardecl
state.
x =
New variable declaration.
/* comments ; everywhere */
Another state gets pushed on the stack. In comments tokens that would
otherwise have significance such as ;
are ignored.
a * a
Value of x
variable.
;
Return to the function
state. The special variable declaration rules
are done.
from pygments import highlight
from pygments.formatters import HtmlFormatter, TerminalFormatter
from pygments.formatters.terminal import TERMINAL_COLORS
from pygments.lexer import inherit
from pygments.lexers.compiled import CFamilyLexer
from pygments.token import *
# New token type for variable declarations. Red makes them stand out
# on the console.
NameDecl = Token.NameDecl
STANDARD_TYPES[NameDecl] = 'ndec'
TERMINAL_COLORS[NameDecl] = ('red', 'red')
class CDeclLexer(CFamilyLexer):
tokens = {
# Only touch variables declared inside functions.
'function': [
# The obvious fault that is hard to get around is that
# user-defined types won't be cathed by this regexp.
(r'(?<=\s)(bool|int|long|float|short|double|char|unsigned|signed|void|'
r'[a-z_][a-z0-9_]*_t)\b',
Keyword.Type, 'vardecl'),
inherit
],
'vardecl' : [
(r'\s+', Text),
# Comments
(r'/(\\\n)?[*](.|\n)*?[*](\\\n)?/', Comment.Multiline),
(r';', Punctuation, '#pop'),
(r'[~!%^&*+=|?:<>/-]', Operator),
# After the name of the variable has been tokenized enter
# a new mode for the value.
(r'[a-zA-Z_][a-zA-Z0-9_]*', NameDecl, 'varvalue'),
],
'varvalue' : [
(r'\s+', Text),
(r',', Punctuation, '#pop'),
(r';', Punctuation, '#pop:2'),
# Comments
(r'/(\\\n)?[*](.|\n)*?[*](\\\n)?/', Comment.Multiline),
(r'[~!%^&*+=|?:<>/-\[\]]', Operator),
(r'\d+[LlUu]*', Number.Integer),
# Rules for strings and chars.
(r'L?"', String, 'string'),
(r"L?'(\\.|\\[0-7]{1,3}|\\x[a-fA-F0-9]{1,2}|[^\\\'\n])'", String.Char),
(r'[a-zA-Z_][a-zA-Z0-9_]*', Name),
# Getting arrays right is tricky.
(r'{', Punctuation, 'arrvalue'),
],
'arrvalue' : [
(r'\s+', Text),
(r'\d+[LlUu]*', Number.Integer),
(r'}', Punctuation, '#pop'),
(r'[~!%^&*+=|?:<>/-\[\]]', Operator),
(r',', Punctuation),
(r'[a-zA-Z_][a-zA-Z0-9_]*', Name),
(r'{', Punctuation, '#push'),
]
}
code = '''
#include <stdio.h>
void main(int argc, char *argv[])
{
int vec_a, vec_b;
int a = 3, /* Mo;yo */ b=5, c=7;
int m = 3 * a + b, x = /* comments everywhere */ a * a;
char *myst = "hi;there";
char semi = ';';
time_t now = /* Null; */ NULL;
int arr[10] = {1, 2, 9 / c};
int foo[][2] = {{1, 2}};
a = b * 9;
c = 77;
d = (int) 99;
}
'''
for formatter in [TerminalFormatter, HtmlFormatter]:
print highlight(code, CDeclLexer(), formatter())