[SOLVED] Why is pyparsing removing whitespace?

Why is pyparsing removing whitespace?

I am using pyparsing to parse sass files. I need to get the sass files into a particular format to work with them, and preserving whitespace is very important.

import pyparsing as pp

pp.ParserElement.set_default_whitespace_chars('')

def parse(input_str):
    nested_exp = pp.nestedExpr('{','}').parseString("{"+input_str+"}").asList()
    return nested_exp

input_str = """selector
{
  a:b;
  c:d;
  selector
  {
    a:b;
    c:d;
  }
  y:z;
}"""

my_output = parse(input_str)

Expected Output: 
[['selector\n',['\n  a:b;\n  c:d;\n  selector\n',['\n    a:b;\n    c:d;\n'],'\n  y:z;\n']]]
My output: 
[['selector', ['a:b;\n  c:d;\n  selector', ['a:b;\n    c:d;'], 'y:z;']]]

Notice there should be 2 spaces before the 'a:b;' and 'y:z;'. Why did pyparsing remove them even though I used set_default_whitespace_chars('')? Some whitespaces are being removed and others are not.

Solution

nested_expr makes a number of assumptions about desired resulting content, most notably that the items that make up the 'content' of the expression are going to be space-separated values. So from this:

pp.nestedExpr().parse_string("(1 2 3( 4 5) 'xyz()')").as_list()

we get this:

[['1', '2', '3', ['4', '5'], "'xyz()'"]]

There are a few parameters you can pass to customize this, such as defining the ignore_expr (defaults to a quoted_string expr, so that the parentheses inside quotes are not accidentally parsed as nesting delimiters). You can also try to be more specific about defining the content that is supposed to be inside the nested expression, like this example that gets nested lists of ints:

nested_ints = pp.nested_expr(content=pp.DelimitedList(pp.common.integer))
result = nested_ints.parse_string("(1,2(3,4,5)(6)7)")
print(result.as_list())

[[1, 2, [3, 4, 5], [6], 7]]

Thanks for extracting this down to an easy-to-reproduce example. Unfortunately, how you go about addressing this depends on what you are trying to do with this nested list.

If you just want to detect a nested list and keep it intact, you can wrap the expression in the original_text_for helper, like this:

    nested_exp = pp.original_text_for(pp.nestedExpr('{','}')).parseString("{"+input_str+"}").asList()

This just returns the original text as a string:

['{selector\n{\n  a:b;\n  c:d;\n  selector\n  {\n    a:b;\n    c:d;\n  }\n  y:z;\n}}']

You can see that this preserves all the internal whitespace, though this does not match the expected list that you posted.

Unfortunately, the current implementation of nested_expr does some internal whitespace stripping, even if you have set the default whitespace characters to ''. I'll look at making this more aware of the defaults in the next release. For now, you'll probably need to roll your own nested_expr to preserve whitespace, something like this:

opener = "{"
closer = "}"
ignoreExpr = pp.quoted_string().leave_whitespace()
content = pp.Combine(
                        pp.OneOrMore(
                            ~ignoreExpr
                            + pp.CharsNotIn(
                                opener + closer + pp.ParserElement.DEFAULT_WHITE_CHARS,
                                exact=1,
                            )
                        )
                    )
nested_expr = pp.Forward()
nested_expr <<= pp.Group(pp.Suppress(opener) + (nested_expr | content)[...] + pp.Suppress(closer))

def parse(input_str):
    nested_exp = nested_expr.parseString("{"+input_str+"}").asList()
    return nested_exp

UPDATE: The nested_expr code in pyparsing has been fixed, and this example used to create a unit test, to be released in pyparsing 3.2.2.