performancef#sequencedistincttruncate

Can `Seq.truncate` after `Seq.distinct` in F# be beneficial in terms of performance


In the solution of the Pangram exercise:

let totalAlphabetCount = 1 + int ('Z' - 'A')

let isPangram (input: string): bool =
    input
    |> Seq.filter System.Char.IsAsciiLetter
    |> Seq.map System.Char.ToUpper
    |> Seq.distinct
    |> Seq.truncate totalAlphabetCount
    |> Seq.length = totalAlphabetCount

Does Seq.truncate totalAlphabetCount improve performance in huge input? For example, we have a 1Gb string with all ASCII letters in the first hundred characters.


Solution

  • The answer is yes. I ran an empirical test on my computer, with the following results:

    This is a difference of several orders of magnitude!

    The reason for this is that Seq.distinct is a lazy function. With the call to Seq.truncate in place, Seq.distinct will stop iterating once it finds 26 distinct characters, which happens in the first 100 characters of the input. Without the call to Seq.truncate, Seq.distinct must continue looking for distinct characters through the entire string, even though we know that it's not possible that others exist.

    For clarity, here's the implementation of Seq.distinct to show its lazy behavior:

        let distinct source =
            checkNonNull "source" source
    
            seq {
                let hashSet = HashSet<'T>(HashIdentity.Structural<'T>)
    
                for v in source do
                    if hashSet.Add v then
                        yield v
            }
    

    Test code:

    let rng = Random(0)
    
    let randomChar _ =
        rng.Next(32, 128)
            |> char
    
    let randomString len =
        Array.init len randomChar
            |> String
    
    let str =
        randomString 1_000_000_00
    
    let test () =
        isPangram str |> ignore
    
    test ()   // warmup
    
    Seq.init 10 (fun _ ->
        let stopwatch = Stopwatch.StartNew()
        test ()
        stopwatch.Elapsed.TotalMilliseconds)
        |> Seq.average
        |> printfn "%A"