pythonlistparsingsplitbrackets

Splitting a mathematical string into a list while keeping everything in matching brackets together


I am trying to get any mathematical string to split into a list by operators (i.e. "+", "-", "/", "*"), while keeping anything in a matching number of brackets together as one list element.

Here are some very random examples and the desired outputs of what I want to achieve:

import math 

equation = "5+5*10"
equation_segmented = ["5", "+", "5", "*", "10"]

equation = "(2*2)-5*(math.sqrt(9)+2)"
equation_segmented = ["(2*2)", "-", "5", "*", "(math.sqrt(9)+2)"]

equation = "(((5-3)/2)*0.5)+((2*2))*(((math.log(5)+2)-2))"
equation_segmented = ["(((5-3)/2)*0.5)", "+", "((2*2))", "*", "(((math.log(5)+2)-2))"]

Note: alphabetical letters (or symbols like "π") should be included in the brackets too.

My first thought was using a regex:

import re

equation_segmented = re.split("([\+|\-|\*|\/]|\(.*\))", equation)

The problem here, however, is that it does not account for matching brackets.

I then thought of iterating through the string manually and keeping track of the parentheses with a counter, but did not get it to work (I was pretty much only able to write my own 're.split' function).

Lastly I went back to regex (equation_segmented = re.split("([\+|\-|\*|\/])", equation)) and thought about just splitting the string by operators, to then "".join() all the list elements in matching brackets afterwards - yet again to no avail.

I am not sure if this might be a problem for a parser, but I am not sure where to start.


Solution

  • A custom (non-regex) function is trivial. All you need to do is ensure that you keep track of opening and closing parentheses.

    Assuming the string formulae are syntactically correct then:

    OPS = set("+-*/")
    PMAP = {"(": 1, ")": -1}
    
    
    def tokenizer(s: str) -> list[str]:
        result = [""]
        pcount = 0
    
        for c in s:
            pcount += PMAP.get(c, 0)
            if c in OPS:
                if pcount == 0:
                    result.extend([c, ""])
                    continue
            result[-1] += c
    
        return result
    
    
    equations = [
        "5+5*10",
        "(2*2)-5*(math.sqrt(9)+2)",
        "(((5-3)/2)*0.5)+((2*2))*(((math.log(5)+2)-2))",
    ]
    
    for eq in equations:
        print(tokenizer(eq))
    

    Output:

    ['5', '+', '5', '*', '10']
    ['(2*2)', '-', '5', '*', '(math.sqrt(9)+2)']
    ['(((5-3)/2)*0.5)', '+', '((2*2))', '*', '(((math.log(5)+2)-2))']