I've noticed a bit of odd behavior with the -replace command in PowerShell, and I'm just wondering if anyone out there can tell me what's going on.
I was writing some code to make sure that a new username entered in a textbox only contained numbers, letters (which were previously made lower case), dots, dashes, and underscores. The method I used was to copy the entered text into another variable, but use -replace to strip off everything valid, and if anything was left behind use another -replace to remove that from the textbox.
However, when I tried the code as:
$textboxNewUserName.Text -replace "[0-9a-z.-_]", ""
I found that I could enter whatever I wanted to because this statement would always return nothing.
On the other hand, this code works fine:
$textboxNewUserName.Text -replace "[0-9a-z._-]", ""
Can anyone tell me why the placement of the underscore makes any difference?
In order to match a -
(HYPHEN-MINUS, U+002D
) char. verbatim inside [...]
, a (positive) character-group regex expression, you must place it either at the very start or at the very end; otherwise, it is interpreted as a metacharacter that separates the endpoints of a range of characters, such as in a-z
Your second example, [0-9a-z._-]
, worked as expected: the -
is a the very end.
By contrast, your first example, [0-9a-z.-_]
, didn't, because .-_
was interpreted as a range of characters with .
and _
as the endpoints, which ended up matching more characters than you intended:
Ranges comprise the contiguous series of Unicode characters between the lower (first) and the higher (second point) endpoint, as determined by their Unicode code points.
The hex. code point of .
is 0x2e
and that of _
0x5f
, meaning that the range comprises 50 characters; you can print them as follows:
[char[]] (([int] [char] '.') .. ([int] [char] '_'))
In PowerShell (Core) 7 it is now possible to use characters directly as the endpoints of a PowerShell range operation, ..
, so you can simplify to:
# PowerShell 7 only
'.'..'_'
Two asides:
You can use a negative character group to achieve your result with a single -replace
operation: removal of all characters other than permitted ones.
Using '...'
- verbatim string literals - rather than "..."
- expandable string literals in regex contexts in PowerShell is a good habit to form, unless up-front string expansion (interpolation) is truly needed.
Using "..."
can be especially tricky in the replacement expression of the -replace
operations, where tokens such as $1
refer to capture-group results, and if you used "$1"
, PowerShell would try to expand $1
as a PowerShell variable, (typically) resulting in the empty string.
See this answer for details.
In your case, use of the empty string as the replacement expression is optional, given that omitting such an expression implicitly uses the empty string, i.e. in effect removes what was matched from the input string.
Therefore:
# Removes all chars. OTHER than 0-9, a-z (case-insensitively) and . and _
$textboxNewUserName.Text -replace '[^0-9a-z._-]'