c++lexerpygments

Pygment lexer multiple tokens


I'm using the lexer of Pygments, a Python plugin. I want to get tokens for a C++ code, in particular when a new variable is declared, e.g.

int a=3,b=5,c=4;

Here a,b,c should be given the type "Declared variables", which is different from

a=3,b=5,c=4;

Here a,b,c should simply be given type "Variables", since they have been declared before.

I'd like to use the ability of the lexer to scan multiple tokens at once (See Pygments documentation) I want to write a regex along the lines of

(int)(\s)(?:([a-z]+)(=)([0-9]+)(,))*, bygroups(Type,Space,Name,Equal,Number,Comma)

(The "?:" is just to tell Pygments that this grouping shouldn't be used in the bygroups.)

However, instead of matching any number of declarations in the line, it only returns tokens for the last declaration in the line (In the case, the "c=4" portion.) How can I make it return the tokens for all declarations in the line?


Solution

  • What you need is a stateful lexer. The reason why your regexp won't work is because the groups aren't continuous.

    int a=3,b=5,c=4;
    

    Here you want the chars 0..2 to be Type, 3..3 Space, 4..7 Name, Equal Number and Comma then again Name, Equal, Number and Comma. That's no good.

    The solution is to remember when a type declaration has been seen, enter a new lexer mode which continues until the next semicolon. See Changing states in the pygments documentation.

    Below is a solution that uses CFamilyLexer and adds three new lexer states. So when it sees a line like this while in the function state:

    int m = 3 * a + b, x = /* comments ; everywhere */ a * a;
    

    First it consumes:

    int
    

    It matches the new rule I added, so it enters the vardecl state:

    m
    

    Oh a name of a variable! Since the lexer is in the vardecl state, this is a newly defined variable. Emit it as a NameDecl token. Then enter the varvalue state.

    3
    

    Just a number.

    *
    

    Just an operator.

    a
    

    Oh a name of a variable! But now we are in the varvalue state so it is not a variable declaration, just a regular variable reference.

    + b
    

    An operator and another variable reference.

    ,
    

    Value of variable m fully declared. Go back to the vardecl state.

    x =
    

    New variable declaration.

    /* comments ; everywhere */
    

    Another state gets pushed on the stack. In comments tokens that would otherwise have significance such as ; are ignored.

    a * a
    

    Value of x variable.

    ;
    

    Return to the function state. The special variable declaration rules are done.

    from pygments import highlight
    from pygments.formatters import HtmlFormatter, TerminalFormatter
    from pygments.formatters.terminal import TERMINAL_COLORS
    from pygments.lexer import inherit
    from pygments.lexers.compiled import CFamilyLexer
    from pygments.token import *
    
    # New token type for variable declarations. Red makes them stand out
    # on the console.
    NameDecl = Token.NameDecl
    STANDARD_TYPES[NameDecl] = 'ndec'
    TERMINAL_COLORS[NameDecl] = ('red', 'red')
    
    class CDeclLexer(CFamilyLexer):
        tokens = {
            # Only touch variables declared inside functions.
            'function': [
                # The obvious fault that is hard to get around is that
                # user-defined types won't be cathed by this regexp.
                (r'(?<=\s)(bool|int|long|float|short|double|char|unsigned|signed|void|'
                 r'[a-z_][a-z0-9_]*_t)\b',
                 Keyword.Type, 'vardecl'),
                inherit
            ],
            'vardecl' : [
                (r'\s+', Text),
                # Comments
                (r'/(\\\n)?[*](.|\n)*?[*](\\\n)?/', Comment.Multiline),
                (r';', Punctuation, '#pop'),
                (r'[~!%^&*+=|?:<>/-]', Operator),
                # After the name of the variable has been tokenized enter
                # a new mode for the value.
                (r'[a-zA-Z_][a-zA-Z0-9_]*', NameDecl, 'varvalue'),
            ],
            'varvalue' : [
                (r'\s+', Text),
                (r',', Punctuation, '#pop'),
                (r';', Punctuation, '#pop:2'),
                # Comments
                (r'/(\\\n)?[*](.|\n)*?[*](\\\n)?/', Comment.Multiline),                
                (r'[~!%^&*+=|?:<>/-\[\]]', Operator),
                (r'\d+[LlUu]*', Number.Integer),            
                # Rules for strings and chars.
                (r'L?"', String, 'string'),
                (r"L?'(\\.|\\[0-7]{1,3}|\\x[a-fA-F0-9]{1,2}|[^\\\'\n])'", String.Char),
                (r'[a-zA-Z_][a-zA-Z0-9_]*', Name),
                # Getting arrays right is tricky.
                (r'{', Punctuation, 'arrvalue'),
            ],
            'arrvalue' : [
                (r'\s+', Text),
                (r'\d+[LlUu]*', Number.Integer),
                (r'}', Punctuation, '#pop'),
                (r'[~!%^&*+=|?:<>/-\[\]]', Operator),
                (r',', Punctuation),
                (r'[a-zA-Z_][a-zA-Z0-9_]*', Name),
                (r'{', Punctuation, '#push'),
            ]
        }
    
    code = '''
    #include <stdio.h>
    
    void main(int argc, char *argv[]) 
    {
        int vec_a, vec_b;
        int a = 3, /* Mo;yo */ b=5, c=7;
        int m = 3 * a + b, x = /* comments everywhere */ a * a;
        char *myst = "hi;there";
        char semi = ';';
        time_t now = /* Null; */ NULL;
        int arr[10] = {1, 2, 9 / c};
        int foo[][2] = {{1, 2}};
    
        a = b * 9;
        c = 77;
        d = (int) 99;
    }
    '''
    for formatter in [TerminalFormatter, HtmlFormatter]:
        print highlight(code, CDeclLexer(), formatter())