Given a unknown string with an unknown size, e.g. a ScriptBlock expression or something like:
$Text = @'
LOREM IPSUM
Lorem Ipsum is simply dummy text of the printing and typesetting industry.
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.
It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.
It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.
'@
I would like to summarize the string to a single line (replace all the consecutive white spaces to a single white space) and truncate it to a specific $Length
:
$Length = 32
$Text = $Text -Replace '\s+', ' '
if ($Text.Length -gt $Length) { $Text = $Text.SubString(0, $Length) }
$Text
LOREM IPSUM Lorem Ipsum is simpl
The issue is that if it concerns a large string, it isn't very effective towards replacing the white spaces: it replaces all white spaces in the whole $Text
string where only need to replace the first few white spaces till I have a string of the required size ($Length = 32
).
Swapping the -replace
and SubString
operations isn't desired as well as that would return a lesser length than required or even a single space for any $Text
string that starts with something like 32 white spaces.
Question:
How can I effectively merge the two (-replace
and SubString
) operations so that I am not replacing more white spaces than necessarily and get a string of the required length (in case the $Text
string is larger than the required length)?
Update
I think I am close by using a MatchEvaluator Delegate:
$Length = 8
$TotalSpaces = 0
$Delegate = {
if ($Args[0].Index - $TotalSpaces -gt $Length) {
'{break}'
([Ref]$TotalSpaces).Value = [int]::MaxValue
}
else { ([Ref]$TotalSpaces).Value += $Args[0].Value.Length }
}
[regex]::Replace('test 0 1 2 3 4 5 6 7 8 9', '\s+', $Delegate)
test01234{break}56789
Now the question is how can I break the regex processing at the {break}
?
Note that for performance reasons I really want to break out and not substitute the <regular-expression>
with the found match (which makes it look like it stopped).
Perhaps a more manual approach is faster than trying to do it with regex, of course it's a lot more code.
$Text = @'
LOREM IPSUM
Lorem Ipsum is
simply dummy text
'@
$Length = 32
$sb = [System.Text.StringBuilder]::new($Length)
foreach ($char in $Text.GetEnumerator()) {
if ($sb.Length -eq $Length) {
break
}
if ([char]::IsWhiteSpace($char)) {
if (-not $prevSpace) {
$sb = $sb.Append(' ')
}
$prevSpace = $true
continue
}
$sb = $sb.Append($char)
$prevSpace = $false
}
$sb.ToString()
Very similar approach using String.Create
might probably be even faster but will need pre-compile or Add-Type
it. You can find an example here.