I want the input string "add [7,8,9+5,'io open'] 7&4 67"
to be split like ['add', "[7,8,9+5,'io open']", '7&4', '67']
, i.e, within the line, strings must remain within quotes and musn't be split at all , and otherwise whitespace based splitting is required, like so :
>>> import shlex
>>> shlex.split("add [7,8,9+5,\\'io\\ open\\'] 7&4 67")
['add', "[7,8,9+5,'io open']", '7&4', '67']
But the user shouldn't have to use the \\
if possible, at least not for quotes but if possible not for in-string whitespace too.
What would a function break_down()
that does the above look like ? I attempted the below, but it doesn't deal with in-string whitespace :
>>> import shlex
>>> def break_down(ln) :
... ln = ln.replace("'","\\'")
... ln = ln.replace('"','\\"')
... # User will still have to escape in-string whitespace
... return shlex.split(ln) # Note : Can't use posix=False; will split by in-string whitespace and has no escape seqs
...
>>> break_down("add [7,8,9+5,'io\\ open'] 7&4 67")
['add', "[7,8,9+5,'io open']", '7&4', '67']
>>> break_down("add [7,8,9+5,'io open'] 7&4 67")
['add', "[7,8,9+5,'io", "open']", '7&4', '67']
Maybe be there's a better function/method/technique to do this, I'm not very experienced with the entire standard library yet. Or maybe I'll just have to write a custom split()
?
EDIT 1 : Progress
>>> def break_down(ln) :
... ln = r"{}".format(ln) # escape sequences don't require \\
... ln = ln.replace("'",r"\'")
... ln = ln.replace('"',r'\"')
... return shlex.split(ln)
So now the user only has to use a single \
to escape any quotes/spaces etc , kind of like they would in a shell. Seems workable.
I solved this like I should have, by writing a custom lexing system (sort of).
I decided to use re
, because the code uses re
a lot all over anyways, and with help from this reddit comment , have settled on this :
def lex(ln):
ln = ln.split('#')[0] # Strip comments
tkn_delims, relst = '\'\'""{}()[]',[] # Edit tkn_delims to add more delimiters
for i in range(0,len(tkn_delims),2):
# Add regex for delimiter
relst.append(r'\{0}[^{1}]*\{1}'.format(tkn_delims[i],tkn_delims[i+1]))
regex = '|'.join(relst) + r'|\S+' # Build regex
import re
return re.findall(regex,ln)
Edit : Thanks to @furas 's comment : "first reaction: you can't use # in arguments..." , code edited to only recognise start of comment if #
appears as 1st element of a token . Thus :
<command> '#...' ['#...#']
lexes to ['command',"'#...'","['#...#']"]
<command> '...' # does xyz
or <command> '...' #does xyz
lexes to ['<command>',"'...'"]
.Edited lex()
:
def lex(ln) :
''' Lexing :
1. Generate regex for each token type :
a) tokens that are python sequence literals.
b) tokens that are whitespace delimited.
There is only one 'layer' of lexing,i.e in case of sequences within sequences, the entire outermost sequence is one token.
2. Remove tokens that fall into comments
3. Return list of tokens
'''
token_delims = '\'\'""{}()[]'
regex_subexperessions = []
for i in range(0,len(token_delims),2) :
regex_subexperessions.append(r'\{0}[^{1}]*\{1}'.format(token_delims[i],token_delims[i+1])) # Regex for each sequence delimiter pair
regex = '|'.join(regex_subexperessions) + r'|\S+' # Combine with regex for whitespace delimitation on the remainder
tokens = re.findall(regex,ln)
comment = False
for token in tokens :
if comment : tokens.remove(token)
elif token[0] == '#' : comment = True
return tokens