I am trying to replicate bug detection software built for JavaScript file to use it for finding bugs in Python files.
The process involves finding the start and end positions of a token based on column number.
Below is the output of using acorn JS parser on a .js file:
In the above image, the start and end locations of a token are the column numbers in the entire document.
I have checked Python tokenizer, which only gives the loc.start and loc.end values equivalent to the ones in the above picture.
But how to get the start and end values for pythons tokens just like acorn output picture?
In principle, all you need to convert linenumber/offset pairs into byte offsets into the documents is a list of the starting byte offset of each line. So one simple way to do this would be to accumulate the information as the file is read. That's reasonably simple since you can give tokenize
your own function which returns input lines. So you can collect a mapping from line number to file position, and then wrap tokenize
in a function which uses that mapping to add start and end indices.
In the following example, I use file.tell
to extract the current file position. But that won't work if the input is not a seekable file; in that case, you would need to come up with some alternative, such as keeping track of the number of bytes returned [Note 1]. Depending on what you need the indices for, that might or might not be important: if you only need unique numbers, for example, it would be sufficient to keep a running total of the string lengths of each line.
import tokenize
from collections import namedtuple
MyToken = namedtuple('MyToken', 'type string startpos endpos start end')
def my_tokenize(infile):
'''Generator which requires one argument, typically an io.ioBase
object, with `tell` and `readline` member functions.
'''
# Used to track starting position of each line.
# Note that tokenize starts line numbers at 1 and column numbers at 0
offsets = [0]
# Function used to wrap calls to infile.readline(); stores current
# stream position at the beginning of each line.
def wrapped_readline():
offsets.append(infile.tell())
return infile.readline()
# For each returned token, substitute type with exact_type and
# add token boundaries as stream positions
for t in tokenize.tokenize(wrapped_readline):
startline, startcol = t.start
endline, endcol = t.end
yield MyToken(t.exact_type, t.string,
offsets[startline] + startcol,
offsets[endline] + endcol,
t.start, t.end)
# Adapted from tokenize.tokenize.main(). Errors are mine.
def main():
import sys
from token import tok_name
def print_tokens(gen):
for t in gen:
rangepos = f'{t.startpos}-{t.endpos}'
range = f'{t.start[0]},{t.start[1]}-{t.end[0]},{t.end[1]}'
print(f'{rangepos:<10} {range:<20} '
f'{tok_name[t.type]:<15}{t.string!r}')
if len(sys.argv) <= 1:
print_tokens(my_tokenize(sys.stdin.buffer))
else:
for filename in sys.argv[1:]:
with open(filename, 'rb') as infile:
print_tokens(my_tokenize(infile))
if __name__ == '__main__':
main()
readline
is a string, not a bytes
object, so its length is measured in characters rather than bytes; furthermore, on platforms (such as Windows) in which the end-of-line is not a single character, the substitution of the end-of-line with \n
means that the number of characters read does not correspond to the number of characters in the file.