In the solution of the Pangram exercise:
let totalAlphabetCount = 1 + int ('Z' - 'A')
let isPangram (input: string): bool =
input
|> Seq.filter System.Char.IsAsciiLetter
|> Seq.map System.Char.ToUpper
|> Seq.distinct
|> Seq.truncate totalAlphabetCount
|> Seq.length = totalAlphabetCount
Does Seq.truncate totalAlphabetCount
improve performance in huge input?
For example, we have a 1Gb string with all ASCII letters in the first hundred characters.
The answer is yes. I ran an empirical test on my computer, with the following results:
This is a difference of several orders of magnitude!
The reason for this is that Seq.distinct
is a lazy function.
With the call to Seq.truncate
in place, Seq.distinct
will stop iterating once it finds 26 distinct characters, which happens in the first 100 characters of the input. Without the call to Seq.truncate
, Seq.distinct
must continue looking for distinct characters through the entire string, even though we know that it's not possible that others exist.
For clarity, here's the implementation of Seq.distinct
to show its lazy behavior:
let distinct source =
checkNonNull "source" source
seq {
let hashSet = HashSet<'T>(HashIdentity.Structural<'T>)
for v in source do
if hashSet.Add v then
yield v
}
Test code:
let rng = Random(0)
let randomChar _ =
rng.Next(32, 128)
|> char
let randomString len =
Array.init len randomChar
|> String
let str =
randomString 1_000_000_00
let test () =
isPangram str |> ignore
test () // warmup
Seq.init 10 (fun _ ->
let stopwatch = Stopwatch.StartNew()
test ()
stopwatch.Elapsed.TotalMilliseconds)
|> Seq.average
|> printfn "%A"