I want to ignore every comment like { comments }
and // comments
.
I have a pointer named peek that checks my string character by character. I know how to ignore newlines, tabs, and spaces but I don't know how to ignore comments.
string = """ beGIn west WEST north//comment1 \n
north north west East east south\n
// comment west\n
{\n
comment\n
}\n end
"""
tokens = []
tmp = ''
for i, peek in enumerate(string.lower()):
if peek == ' ' or peek == '\n':
tokens.append(tmp)
# ignoing WS's and comments
if(len(tmp)>0):
print(tmp)
tmp = ''
else:
tmp += peek
Here is my result:
begin
west
west
north//
comment1
north
north
west
east
east
south
{
comment2
}
end
As you see spaces are ignored but comments aren't.
How can I get a result like below?
begin
west
west
north
north
north
west
east
east
south
end
Simply use global variable skip = False
and set it True
when you get {
and set False
when you get }
and the rest of your if/else
run in if not skip:
string = """ beGIn west WEST north//comment1 \n
north north west East east south\n
// comment west\n
{\n
comment\n
}\n end
"""
tokens = []
tmp = ''
skip = False
for i, peek in enumerate(string.lower()):
if peek == '{':
skip = True
elif peek == '}':
skip = False
elif not skip:
if peek == ' ' or peek == '\n':
tokens.append(tmp)
# ignoing WS's and comments
if(len(tmp)>0):
print(tmp)
tmp = ''
else:
tmp += peek
Because you may have nested { { } }
like
{\n
{ comment1 }\n
comment2\n
{ comment3 }\n
}\n
so better use skip
to count {
}
string = """ beGIn west WEST north//comment1 \n
north north west East east south\n
// comment west\n
{\n
{ comment1 }\n
comment2\n
{ comment3 }\n
}\n end
"""
tokens = []
tmp = ''
skip = 0
for i, peek in enumerate(string.lower()):
if peek == '{':
skip += 1
elif peek == '}':
skip -= 1
elif not skip: # elif skip == 0:
if peek == ' ' or peek == '\n':
tokens.append(tmp)
# ignoing WS's and comments
if(len(tmp)>0):
print(tmp)
tmp = ''
else:
tmp += peek
But maybe it would be better to get all as tokens
and later filter tokens
. But I skip this idea.
EDIT:
Version using Python module sly which works similar to C/C++ tools lex
/yacc
Regex for MULTI_LINE_COMMENT
I found in other tool for building parsers - lark
:
from sly import Lexer, Parser
class MyLexer(Lexer):
# Create it befor defining regex for Tokens
tokens = { NAME, ONE_LINE_COMMENT, MULTI_LINE_COMMENT }
ignore = ' \t'
# Tokens
NAME = r'[a-zA-Z_][a-zA-Z0-9_]*'
ONE_LINE_COMMENT = '\/\/.*'
MULTI_LINE_COMMENT = '{(.|\n)*}'
# Ignored pattern
ignore_newline = r'\n+'
# Extra action for newlines
def ignore_newline(self, t):
self.lineno += t.value.count('\n')
# Work with errors
def error(self, t):
print("Illegal character '%s'" % t.value[0])
self.index += 1
if __name__ == '__main__':
text = """ beGIn west WEST north//comment1
north north west East east south
// comment west
{
{ comment1 }
comment2
{ comment3 }
}
end
"""
lexer = MyLexer()
tokens = lexer.tokenize(text)
for item in tokens:
print(item.type, ':', item.value)
Result:
NAME : beGIn
NAME : west
NAME : WEST
NAME : north
ONE_LINE_COMMENT : //comment1
NAME : north
NAME : north
NAME : west
NAME : East
NAME : east
NAME : south
ONE_LINE_COMMENT : // comment west
MULTI_LINE_COMMENT : {
{ comment1 }
comment2
{ comment3 }
}
NAME : end