pythonparsingtokenizeabstract-syntax-treeacorn

Parser to get start and end positions of a token


I am trying to replicate bug detection software built for JavaScript file to use it for finding bugs in Python files.

The process involves finding the start and end positions of a token based on column number.

Below is the output of using acorn JS parser on a .js file:

acorn parse output

In the above image, the start and end locations of a token are the column numbers in the entire document.

I have checked Python tokenizer, which only gives the loc.start and loc.end values equivalent to the ones in the above picture.

python tokenizer output

But how to get the start and end values for pythons tokens just like acorn output picture?


Solution

  • In principle, all you need to convert linenumber/offset pairs into byte offsets into the documents is a list of the starting byte offset of each line. So one simple way to do this would be to accumulate the information as the file is read. That's reasonably simple since you can give tokenize your own function which returns input lines. So you can collect a mapping from line number to file position, and then wrap tokenize in a function which uses that mapping to add start and end indices.

    In the following example, I use file.tell to extract the current file position. But that won't work if the input is not a seekable file; in that case, you would need to come up with some alternative, such as keeping track of the number of bytes returned [Note 1]. Depending on what you need the indices for, that might or might not be important: if you only need unique numbers, for example, it would be sufficient to keep a running total of the string lengths of each line.

    import tokenize
    from collections import namedtuple
    MyToken = namedtuple('MyToken', 'type string startpos endpos start end')
    
    def my_tokenize(infile):
        '''Generator which requires one argument, typically an io.ioBase
           object, with `tell` and `readline` member functions.
        ''' 
        # Used to track starting position of each line.
        # Note that tokenize starts line numbers at 1 and column numbers at 0
        offsets = [0]
        # Function used to wrap calls to infile.readline(); stores current
        # stream position at the beginning of each line.
        def wrapped_readline():
            offsets.append(infile.tell())
            return infile.readline()
    
        # For each returned token, substitute type with exact_type and
        # add token boundaries as stream positions
        for t in tokenize.tokenize(wrapped_readline):
            startline, startcol = t.start
            endline, endcol = t.end
            yield MyToken(t.exact_type, t.string,
                          offsets[startline] + startcol,
                          offsets[endline] + endcol,
                          t.start, t.end)
    
    # Adapted from tokenize.tokenize.main(). Errors are mine.
    def main():
        import sys
        from token import tok_name
    
        def print_tokens(gen):
            for t in gen:
                rangepos = f'{t.startpos}-{t.endpos}'
                range = f'{t.start[0]},{t.start[1]}-{t.end[0]},{t.end[1]}'
                print(f'{rangepos:<10} {range:<20} '
                      f'{tok_name[t.type]:<15}{t.string!r}')
    
        if len(sys.argv) <= 1:
            print_tokens(my_tokenize(sys.stdin.buffer))
        else:
            for filename in sys.argv[1:]:
                with open(filename, 'rb') as infile:
                    print_tokens(my_tokenize(infile))
    
    if __name__ == '__main__':
        main()
    

    Notes

    1. But that's not as easy as it sounds. Unless you open the file in binary mode, what's returned from readline is a string, not a bytes object, so its length is measured in characters rather than bytes; furthermore, on platforms (such as Windows) in which the end-of-line is not a single character, the substitution of the end-of-line with \n means that the number of characters read does not correspond to the number of characters in the file.