excelvbaperformancetrimremoving-whitespace

Quickly remove unnecessary whitespace from a (very large) string


I'm working with very large (45,000,000+ character) strings in VBA, and I need to remove superfluous whitespace.

One space (aka, ASCII Code 32) is okay but any sections with two or more consecutive spaces should be reduced to only one.

I found a similar question here, although that OP's definition of a "very long string" was only 39,000 characters. The accepted answer was a loop using Replace:

Function MyTrim(s As String) As String
    Do While InStr(s, "  ") > 0
        s = Replace$(s, "  ", " ")
    Loop
    MyTrim = Trim$(s)
End Function

I tried this method and it was "worked", but was painfully slow:

Len In:  44930886 
Len Out: 35322469
Runtime: 247.6 seconds

Is there a faster way to remove whitespace from a "very large" string?


Solution

  • I suspect the performance problem is due to creating a very large number of large intermediate strings. So, any method that does things without creating intermediate strings or with much fewer would perform better.

    A Regex replace has a good chance of that.

    Option Explicit
    
    Sub Test(ByVal text As String)
    
      Static Regex As Object
      If Regex Is Nothing Then
        Set Regex = CreateObject("VBScript.RegExp")
        Regex.Global = True
        Regex.MultiLine = True
      End If
    
      Regex.Pattern = " +" ' space, one or more times
    
      Dim result As String: result = Regex.Replace(text, " ")
      Debug.Print Len(result), Left(result, 20)
    End Sub
    

    With an input string of 45 million characters takes about a second.

    Runner:

    Sub Main()
    
      Const ForReading As Integer = 1
      Const FormatUTF16 As Integer = -1 ' aka TriStateTrue
      Dim fso As Object: Set fso = CreateObject("Scripting.FileSystemObject")
      Dim file As Object: Set file = fso.OpenTextFile("C:\ProgramData\test.txt", ForReading, False, FormatUTF16)
      Dim text As String: text = file.ReadAll()
      Set file = Nothing
      Set fso = Nothing
      Debug.Print Len(text), Left(text, 20)
    
      Test (text)
    
    End Sub
    

    Test data creator (C#):

    var substring = "××\n× ××   ";
    var text = String.Join("", Enumerable.Repeat(substring, 45_000_000 / substring.Length));
    var encoding = new UnicodeEncoding(false, false);
    File.WriteAllText(@"C:\ProgramData\test.txt", text, encoding);
    

    BTW—Since VBA (VB4, Java, JavaScript, C#, VB, …) uses UTF-16, the space character is the one UTF-16 code unit ChrW(32). (Any similarity to or comparison with ASCII, is unnecessary mental gymnastics, and if put into code as ANSI [Chr(32)], unnecessary conversion behind the scenes, with different behavior for different machines, users and times.)