pythonparsingtext-parsingshlex

shlex.split() : How to keep quotes around sub-strings and not split by in-sub-string whitespace?


I want the input string "add [7,8,9+5,'io open'] 7&4 67" to be split like ['add', "[7,8,9+5,'io open']", '7&4', '67'], i.e, within the line, strings must remain within quotes and musn't be split at all , and otherwise whitespace based splitting is required, like so :

>>> import shlex
>>> shlex.split("add [7,8,9+5,\\'io\\ open\\'] 7&4 67")
['add', "[7,8,9+5,'io open']", '7&4', '67']

But the user shouldn't have to use the \\ if possible, at least not for quotes but if possible not for in-string whitespace too.

What would a function break_down() that does the above look like ? I attempted the below, but it doesn't deal with in-string whitespace :

>>> import shlex
>>> def break_down(ln) :
...     ln = ln.replace("'","\\'")
...     ln = ln.replace('"','\\"')
...     # User will still have to escape in-string whitespace
...     return shlex.split(ln) # Note : Can't use posix=False; will split by in-string whitespace and has no escape seqs
...
>>> break_down("add [7,8,9+5,'io\\ open'] 7&4 67")
['add', "[7,8,9+5,'io open']", '7&4', '67']
>>> break_down("add [7,8,9+5,'io open'] 7&4 67")
['add', "[7,8,9+5,'io", "open']", '7&4', '67']

Maybe be there's a better function/method/technique to do this, I'm not very experienced with the entire standard library yet. Or maybe I'll just have to write a custom split() ?

EDIT 1 : Progress

>>> def break_down(ln) :
...     ln = r"{}".format(ln) # escape sequences don't require \\
...     ln = ln.replace("'",r"\'")
...     ln = ln.replace('"',r'\"')
...     return shlex.split(ln)

So now the user only has to use a single \ to escape any quotes/spaces etc , kind of like they would in a shell. Seems workable.


Solution

  • I solved this like I should have, by writing a custom lexing system (sort of).

    I decided to use re, because the code uses re a lot all over anyways, and with help from this reddit comment , have settled on this :

    def lex(ln):
        ln = ln.split('#')[0] # Strip comments
        
        tkn_delims, relst = '\'\'""{}()[]',[] # Edit tkn_delims to add more delimiters  
        for i in range(0,len(tkn_delims),2):
            # Add regex for delimiter
            relst.append(r'\{0}[^{1}]*\{1}'.format(tkn_delims[i],tkn_delims[i+1])) 
        regex = '|'.join(relst) + r'|\S+' # Build regex
        
        import re
        return re.findall(regex,ln)
    

    Edit : Thanks to @furas 's comment : "first reaction: you can't use # in arguments..." , code edited to only recognise start of comment if # appears as 1st element of a token . Thus :

    Edited lex() :

    def lex(ln) :
        ''' Lexing :
        1. Generate regex for each token type :
           a) tokens that are python sequence literals.
           b) tokens that are whitespace delimited. 
           There is only one 'layer' of lexing,i.e in case of sequences within sequences, the entire outermost sequence is one token.
         2. Remove tokens that fall into comments
         3. Return list of tokens
        '''
    
        token_delims = '\'\'""{}()[]'
        regex_subexperessions = [] 
        for i in range(0,len(token_delims),2) :
            regex_subexperessions.append(r'\{0}[^{1}]*\{1}'.format(token_delims[i],token_delims[i+1])) # Regex for each sequence delimiter pair
        regex = '|'.join(regex_subexperessions) + r'|\S+'                                       # Combine with regex for whitespace delimitation on the remainder
    
        tokens = re.findall(regex,ln)
        comment = False
        for token in  tokens :
            if comment : tokens.remove(token)
            elif token[0] == '#' : comment = True
    
        return tokens