pythonwindowscommand-linecommand-line-argumentsshlex

python, windows : parsing command lines with shlex


When you have to split a command-line, for example to call Popen, the best practice seems to be

subprocess.Popen(shlex.split(cmd), ...

but RTFM

The shlex class makes it easy to write lexical analyzers for simple syntaxes resembling that of the Unix shell ...

So, what's the correct way on win32? And what about quote parsing and POSIX vs non-POSIX mode?


Solution

  • There is no valid command-line splitting function so far in the Python stdlib for Windows/multi-platform so far. (Mar 2016)

    subprocess

    So in short for subprocess.Popen .call etc. best do like:

    if sys.platform == 'win32':
        args = cmd
    else:
        args = shlex.split(cmd)
    subprocess.Popen(args, ...)
    

    On Windows the split is not necessary for either values of shell option and internally Popen just uses subprocess.list2cmdline to again re-join the split arguments :-) .

    With option shell=True the shlex.split is not necessary on Unix either.

    Split or not, on Windows for starting .bat or .cmd scripts (unlike .exe .com) you need to include the file extension explicitely - unless shell=True.

    Notes on command-line splitting nonetheless:

    shlex.split(cmd, posix=0) retains backslashes in Windows paths, but it doesn't understand quoting & escaping right. Its not very clear what the posix=0 mode of shlex is good for at all - but 99% it certainly seduces Windows/cross-platform programmers ...

    Windows API exposes ctypes.windll.shell32.CommandLineToArgvW:

    Parses a Unicode command line string and returns an array of pointers to the command line arguments, along with a count of such arguments, in a way that is similar to the standard C run-time argv and argc values.

    def win_CommandLineToArgvW(cmd):
        import ctypes
        nargs = ctypes.c_int()
        ctypes.windll.shell32.CommandLineToArgvW.restype = ctypes.POINTER(ctypes.c_wchar_p)
        lpargs = ctypes.windll.shell32.CommandLineToArgvW(unicode(cmd), ctypes.byref(nargs))
        args = [lpargs[i] for i in range(nargs.value)]
        if ctypes.windll.kernel32.LocalFree(lpargs):
            raise AssertionError
        return args
    

    However that function CommandLineToArgvW is bogus - or just weakly similar to the mandatory standard C argv & argc parsing:

    >>> win_CommandLineToArgvW('aaa"bbb""" ccc')
    [u'aaa"bbb"""', u'ccc']
    >>> win_CommandLineToArgvW('""  aaa"bbb""" ccc')
    [u'', u'aaabbb" ccc']
    >>> 
    
    C:\scratch>python -c "import sys;print(sys.argv)" aaa"bbb""" ccc
    ['-c', 'aaabbb"', 'ccc']
    
    C:\scratch>python -c "import sys;print(sys.argv)" ""  aaa"bbb""" ccc
    ['-c', '', 'aaabbb"', 'ccc']
    

    Watch http://bugs.python.org/issue1724822 for possibly future additions in the Python lib. (The mentioned function on "fisheye3" server doesn't really work correct.)


    Cross-platform candidate function

    Valid Windows command-line splitting is rather crazy. E.g. try \ \\ \" \\"" \\\"aaa """" ...

    My current candidate function for cross-platform command-line splitting is the following function which I consider for Python lib proposal. Its multi-platform; its ~10x faster than shlex, which does single-char stepping and streaming; and also respects pipe-related characters (unlike shlex). It stands a list of tough real-shell-tests already on Windows & Linux bash, plus the legacy posix test patterns of test_shlex. Interested in feedback about remaining bugs.

    def cmdline_split(s, platform='this'):
        """Multi-platform variant of shlex.split() for command-line splitting.
        For use with subprocess, for argv injection etc. Using fast REGEX.
    
        platform: 'this' = auto from current platform;
                  1 = POSIX; 
                  0 = Windows/CMD
                  (other values reserved)
        """
        if platform == 'this':
            platform = (sys.platform != 'win32')
        if platform == 1:
            RE_CMD_LEX = r'''"((?:\\["\\]|[^"])*)"|'([^']*)'|(\\.)|(&&?|\|\|?|\d?\>|[<])|([^\s'"\\&|<>]+)|(\s+)|(.)'''
        elif platform == 0:
            RE_CMD_LEX = r'''"((?:""|\\["\\]|[^"])*)"?()|(\\\\(?=\\*")|\\")|(&&?|\|\|?|\d?>|[<])|([^\s"&|<>]+)|(\s+)|(.)'''
        else:
            raise AssertionError('unkown platform %r' % platform)
    
        args = []
        accu = None   # collects pieces of one arg
        for qs, qss, esc, pipe, word, white, fail in re.findall(RE_CMD_LEX, s):
            if word:
                pass   # most frequent
            elif esc:
                word = esc[1]
            elif white or pipe:
                if accu is not None:
                    args.append(accu)
                if pipe:
                    args.append(pipe)
                accu = None
                continue
            elif fail:
                raise ValueError("invalid or incomplete shell string")
            elif qs:
                word = qs.replace('\\"', '"').replace('\\\\', '\\')
                if platform == 0:
                    word = word.replace('""', '"')
            else:
                word = qss   # may be even empty; must be last
    
            accu = (accu or '') + word
    
        if accu is not None:
            args.append(accu)
    
        return args