I'm working with very large (45,000,000+ character) strings in VBA, and I need to remove superfluous whitespace.
One space (aka, ASCII Code 32) is okay but any sections with two or more consecutive spaces should be reduced to only one.
I found a similar question here, although that OP's definition of a "very long string" was only 39,000 characters. The accepted answer was a loop using Replace
:
Function MyTrim(s As String) As String
Do While InStr(s, " ") > 0
s = Replace$(s, " ", " ")
Loop
MyTrim = Trim$(s)
End Function
I tried this method and it was "worked", but was painfully slow:
Len In: 44930886 Len Out: 35322469 Runtime: 247.6 seconds
Is there a faster way to remove whitespace from a "very large" string?
I suspect the performance problem is due to creating a very large number of large intermediate strings. So, any method that does things without creating intermediate strings or with much fewer would perform better.
A Regex replace has a good chance of that.
Option Explicit
Sub Test(ByVal text As String)
Static Regex As Object
If Regex Is Nothing Then
Set Regex = CreateObject("VBScript.RegExp")
Regex.Global = True
Regex.MultiLine = True
End If
Regex.Pattern = " +" ' space, one or more times
Dim result As String: result = Regex.Replace(text, " ")
Debug.Print Len(result), Left(result, 20)
End Sub
With an input string of 45 million characters takes about a second.
Runner:
Sub Main()
Const ForReading As Integer = 1
Const FormatUTF16 As Integer = -1 ' aka TriStateTrue
Dim fso As Object: Set fso = CreateObject("Scripting.FileSystemObject")
Dim file As Object: Set file = fso.OpenTextFile("C:\ProgramData\test.txt", ForReading, False, FormatUTF16)
Dim text As String: text = file.ReadAll()
Set file = Nothing
Set fso = Nothing
Debug.Print Len(text), Left(text, 20)
Test (text)
End Sub
Test data creator (C#):
var substring = "××\n× ×× ";
var text = String.Join("", Enumerable.Repeat(substring, 45_000_000 / substring.Length));
var encoding = new UnicodeEncoding(false, false);
File.WriteAllText(@"C:\ProgramData\test.txt", text, encoding);
BTW—Since VBA (VB4, Java, JavaScript, C#, VB, …) uses UTF-16, the space character is the one UTF-16 code unit ChrW(32)
. (Any similarity to or comparison with ASCII, is unnecessary mental gymnastics, and if put into code as ANSI [Chr(32)
], unnecessary conversion behind the scenes, with different behavior for different machines, users and times.)