powershellabstract-syntax-treecanonicalization

Why are there so many StringConstantExpressionAst's in this PowerShell script?


I am doing canonicalization of some PowerShell dataset and one processing step is to replace all variables with X and all string literals with Y so that I can detect and remove nearly-duplicates.

However, I noticed that for a lot of scripts after canonicalization the whole script boils down to a lot of Y's and some X's with barely any other code. This is not what I anticipated, as there are only handful of variables and string literals in the scripts.

To find all String Literals I used the command:

$Strings = $AST.FindAll({$args[0] -is System.Management.Automation.Language.StringConstantExpressionAst]}, $true)

To troubleshoot this I used ShowPSAst (PowerShell AST visualization tool) to visualize one sample script where the above problem was noticeable.

The original script looks like this:

 Describe "Files" -Tag OSX,Linux {
    It "is utf-8 encoded" {
        $true | Should Be $false
    }
    It "uses Unix-style line endings" {
        $true | Should Be $false
    }
    It "has a shebang" {
        $true | Should Be $false
    }
}
Describe "Placeholder for Nano tests" -Tag Nano {
}

After canonicalization I obtain the following:

Y Y -Tag Y,Y {
    Y Y {
        X | Y Y X
    }
    Y Y {
        X | Y Y X
    }
    Y Y {
        X | Y Y X
    }
}
Y Y -Tag Y {
}

An excerpt of the AST visualization for the above script:

Part of the AST visualization of the above script

Note that the highlighted part in the right panel of the image corresponds to the AST node CommandAST in the left panel, which then has lots of StringConstantExpressionAst nodes as children. Looking at these AST nodes it makes sense why there are so many Y's in my canonical version. However, what's confusing me is why nearly all of the individual tokens in the highlighted code are treated as StringContantExpressionAst. I would expect only "Placeholder for Nano tests" to be treated as a String Literal.

To be precise, I would expect

Describe "Placeholder for Nano tests" -Tag Nano

to be transformed into

 Describe Y -Tag Nano

and NOT into

Y Y -Tag Y

I don't really use PowerShell on my own and don't know its intricacies, so I apologize if I'm missing something basic and I am thankful in advance for any help in understanding this PowerShell behavior.


Solution

  • PowerShell is an interpreted language, which means it doesn't attach a meaning to some parts of your code until you run it. In your case, it doesn't know that the word "Describe" is referring to the Describe function in the Pester module (which might not even be imported into your session yet), and it could equally mean an external program called "Describe.exe" for example.

    All the parser does is make a note of the name of the command as a StringConstantExpressionAst, and it's up to the runtime logic to look for something to run that has that name.

    If you look closely at your AST you'll see that the "Describe" token has a StringConstantType property of BareWord whereas the "my tests" string has a value of DoubleQuoted. If you only want to do your processing on "literal strings" you could use the StringConstantType property as a filter.

    $Strings = $AST.FindAll(
        {
            ( $args[0] -is [System.Management.Automation.Language.StringConstantExpressionAst] ) -and
            ( $args[0].StringConstantType -ne "BareWord" )
        },
        $true
    )
    

    Except then you might miss unquoted strings in things like:

    Describe Files -Tag OSX,Linux {
    

    So another (better?) option might be to just ignore the first child element in any CommandAst nodes instead.