pythonstring-parsing

Splitting on spaces, except between certain characters


I am parsing a file that has lines such as

type("book") title("golden apples") pages(10-35 70 200-234) comments("good read")

And I want to split this into separate fields.

In my example, there are four fields: type, title, pages, and comments.

The desired result after splitting is

['type("book")', 'title("golden apples")', 'pages(10-35 70 200-234)', 'comments("good read")]

It is evident that a simple string split won't work, because it will just split at every space. I want to split on spaces, but preserve anything in between parenthesis and quotation marks.

How can I split this?


Solution

  • This regex should work for you \s+(?=[^()]*(?:\(|$))

    result = re.split(r"\s+(?=[^()]*(?:\(|$))", subject)
    

    Explanation

    r"""
    \s             # Match a single character that is a “whitespace character” (spaces, tabs, and line breaks)
       +              # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
    (?=            # Assert that the regex below can be matched, starting at this position (positive lookahead)
       [^()]          # Match a single character NOT present in the list “()”
          *              # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
       (?:              # Match the regular expression below
                         # Match either the regular expression below (attempting the next alternative only if this one fails)
             \(             # Match the character “(” literally
          |              # Or match regular expression number 2 below (the entire group fails if this one fails to match)
             $              # Assert position at the end of a line (at the end of the string or before a line break character)
       )
    )
    """