(sorry for the long post, to skip directly to the question(s), see bottom)
(UPDATE: if you are revisiting, please see sections marked "update" ;)
I set myself to understand better what was going on under the hood with F# sequences. A task I needed to optimized involved converting strings into a sequence of Unicode codepoints and I was wondering if I could replace the mutable loop we were using into an immutable one without sacrificing too much performance.
One of the challenges is that the returned sequence does not have the same length as the input sequence, because of surrogate pairs that together return one integer. This was the original code looking like:
let stocp0(value: string) : seq<int> =
let pos = ref 0
seq {
while !position < value.Length do
let c = System.Char.ConvertToUtf32(value, !pos)
pos := !position + if c >= 0x10000 then 2 else 1
yield c
}
for-do
I figured that the easiest thing to do was to turn it into a for-do loop (not a for-in-do loop, they have too much extra overhead):
let inline stocp1(value: string) =
seq {
for i = 0 to value.Length - 1 do
if(not <| Char.IsLowSurrogate(value.[i]))
then yield Char.ConvertToUtf32(value, i)
}
This performed 3.2 times slower than the mutable counterpart above. A higher factor than I had imagined.
Seq.mapi
Since a string is already a sequence (ok, there's an IEnumerable<char>
wrapper) I thought to utilize that with existing sequence functions from the Seq
module, hoping that would perhaps bring better performance:
let inline stocp2(value: string) =
value
|> Seq.mapi (fun i c ->
if(Char.IsLowSurrogate(c)) then 0
else Char.ConvertToUtf32(value, i))
|> Seq.filter ((<>) 0)
It performed 3.5 times slower.
Strangely, if I replace value
with value.AsEnumerable()
, it performs significantly faster than stocp1
: factor 3.0.
After several more tests it became clear to me that each |>
creates a new IEnumerable<T>
layer, with all the chaining operations involved (this can also be observed in the FSharp source code of Seq
). But the size of the overhead surprised me. Since none of the above had even remotely equal performance of the original, I decided to try to prevent the extra sequence overhead to kick in and create a Seq.mapiAndFilter
function to do both actions at once.
Seq.mapiAndFilter
Since it is such a delicate loop and I only need to filter on the current character and return based on the current position, perhaps I could remove the extra step involved with Seq.mapi
, which seems expensive.
To do that, I needed to mimic the behavior of the existing Seq.xxx
functions and my first attempt was to do that with a while-yield loop. This would be closest to the original mutable approach but adds one layer of IEnumerable<T>
overhead.
I wrote the following function, which takes a function that returns a boolean and if true, it applies the second function on the position of the current element.
let inline mapiAndFilter f g (e: seq<_>) : seq<_> =
let position = ref -1
let e = e.GetEnumerator()
seq {
while e.MoveNext() do
position := !position + 1
if f e.Current then yield g !position
}
// and the application of it:
let inline stocp3(value: string) =
value.AsEnumerable()
|> mapiAndFilter (not << Char.IsLowSurrogate) (fun i -> Char.ConvertToUtf32 (value, i))
The result was much better than previous attempts, it clocked in at 1.5 times the performance of the mutable solution. Still disappointingly slow, though, it seemed to imply that the added overhead with the enumerators is about 50% in tight loops.
Seq.mapiAndFilter
To find out what was going on under the hood, I then decided to write the enumerable type explicitly, which should give me the opportunity to find out whether any boilerplate checks added in the FSharp libraries had something to do with the low performance characteristics.
Without the safe-guards FSharp Seq
functions use internally (to raise an error on illegal usage of Current etc), I came up with this:
let mapiAndFilter f g (e : seq<'a>) : seq<'b> =
let i = ref -1
let e = e.GetEnumerator()
let inline getEnum() = {
new IEnumerator<'b> with
member x.Current = g !i
interface System.Collections.IEnumerator with
member x.Current = box <| g !i
member x.MoveNext() =
let rec next() =
i := !i + 1
e.MoveNext() && (f e.Current || next())
next()
member x.Reset() = noReset()
interface System.IDisposable with
member x.Dispose() = e.Dispose()
}
{
new System.Collections.Generic.IEnumerable<'b> with
member x.GetEnumerator() = getEnum()
interface System.Collections.IEnumerable with
member x.GetEnumerator() = getEnum() :> System.Collections.IEnumerator
}
// apply the same way as before:
let inline stocp4(value: string) =
value.AsEnumerable()
|> mapiAndFilter (not << Char.IsLowSurrogate) (fun i -> Char.ConvertToUtf32 (value, i))
This became our current winner! It appeared to clock in at only 1.1 times slower than the original mutable function. Of course, it uses mutable state, but so do all the Seq.xxx
functions internally anyway.
General note on all attempts above: I have also tested with ToCharArray()
, which improves performance on small-to-medium input, but becomes detrimental on large input strings, esp. when not all items need to be enumerated. Many other approaches I left out, because their performance was so much worse (Seq.choose
over Seq.filter
is much slower, Seq.collect
, very slow etc).
I used the following for performance comparison (apparently, Seq.length
is the fastest way to force-iterate, Seq.last
and Seq.iter
are much slower):
let input = "ab\U0001ABCDcde\U0001ABCEfghi\U0001ABCF"
let run f = for i in 1 .. 1000000 do f input |> Seq.length |> ignore;;
run stocp1
// etc
Results:
Function CPU Factor
stocp0 0.296 1.0
stocp1 0.951 3.2
stocp2 1.029 3.5
stocp2' 0.873 3.0
stocp3 0.436 1.5
stocp4 0.327 1.1
stocp5 0.405 1.3 (latkin's answer, adj. with Array.toSeq)
stocp'
is the version that uses AsEnumerable()
on the string prior to passing it to the Seq.xxx
functions. All other functions already use this.
I also tested with longer and with very large (50MB) strings, which is our typical use-case, and while the timings are less steady on subsequent runs, the effective factors are about the same as above.
Update: I added latkin's answer as stocp5
, but had to adjust by adding a Array.toSeq
to it. Without it, it clocks in at 0.234
, which is faster than the original while-loop. Unfortunately, I require a sequence (we must use lazy loading and cannot hold whole strings in memory).
The above comparison only tests iteration, which helps find the issues caused by the stacked iterators. However, the timings are a little different if you add element access to the equation. I enforced it by an added Seq.map id
:
let runmap f = for i in 1 .. 1000000 do f input |> Seq.map id |> Seq.length |> ignore;;
Results:
Function CPU Factor
stocp0 0.795 1.0
stocp1 1.528 1.9
stocp2 1.731 2.2
stocp2' 1.812 2.3
stocp3 0.936 1.2
stocp4 0.842 1.1
stocp5 0.873 1.1 (credit: latkin, see his answer and notes above)
Since our typical use-cases do not require full iteration, I've added a test that just iterates until the 2nd surrogate pair at position 6, with larger sized input (3932160 characters).
let runmapnth f = for i in 1 .. 1000000 do f input |> Seq.map id |> Seq.nth 6 |> ignore;;
Results:
Function CPU Factor
stocp0 0.624 1.0
stocp1 1.029 1.6
stocp2 1.263 2.0
stocp2' 1.107 1.8
stocp3 0.717 1.1
stocp4 0.624 1.0
stocp5 --- --- OOM
The OutOfMemoryException
with latkin's answer surprised me a bit, it means that the created arrays were not cleaned up when used in a tight loop as above. My machine allocated 8GB a few times in a few seconds, and drops (GC'ed?) in-between, but in the end still fails. Strange:
The other performance characteristics are as can be expected based on earlier observations.
With the last exercise above, I found out something I hadn't expected: the F# compiler only calls the non-generic IEnumerator.Current
and never calls the IEnumerator<T>.Current
. This may explain in part why the performance degradation with chained sequence filters is so noticeable when the object you are performing it on is a value type: the boxing places it on the heap and back again, which is terrible.
Why is the compiler not using the generic interface?
How come that the for-loop is so slow, what happens internally here? Isn't it supposed to turn into a tail-call, which is then compiled into a fast loop internally?
Is there a more natural or other way of writing a filter like I did (mapi, then filter) that does not have the drawbacks of the detrimental performance I described?
Why is there such a big difference between piping the string directly (slow) and string.AsEnumerable()
(fast)?
I have many more questions, but the SO-format generally wants you to ask a simple single question only, which I clearly didn't. Sorry to have been so elaborate, I hope I doesn't put too many people off to come with some insightful observations.
UPDATE: as pointed out in the comments, the boxing seems to only appear when run from FSharp Interactive (FSI). If you take stocp4
and change the calling code by adding a redundant Seq.filter ((<>) 0)
(or something similar), it will instead call the unboxed accessor. Why? No idea.
Ok I'll take a shot. All code and benchmark results can be found here.
Lazy v Eager Seqs are slow. Comprehensions are slow. They are a convenient abstraction that involves a lot of compiler-generated goop and allocations, and should generally just be avoided altogether if perf is important. All of the impls in question are handily beat by below simple non-lazy solution.
// ~50% faster for given test case
// still ~20% faster even for length 1.5M string
let eager1 (value: string) =
let result = ResizeArray(value.Length)
for i in 0 .. value.Length - 1 do
if not (Char.IsLowSurrogate(value.[i]))
then result.Add(Char.ConvertToUtf32(value, i))
result.ToArray()
Generic v Non Your generic code is being invoked in the benchmark function.
Add a logging statement to both .Current
impls, and pipe your output sequence to |> Seq.iter (printfn "%d")
, and you will see it's the generic one that's invoked.
Were you testing in FSI? For whatever reason, FSI's "print a few elements of this sequence to the console" code does wind up in the non-generic path, but that doesn't affect the executing code. Maybe that's what you were seeing?
Loops in seq{ } Loops inside of seq { }
and other computation expressions are not regular loops. (in fact pretty much nothing "normal looking" inside of computation expressions is actually normal, which is kind of the point :)) As indicated in the computation expression docs, a for
loop ends up codgening as an iteration over another enumerable. while
loops are a bit simpler.
This more or less explains why your "attempt 1" is so much slower - the for
loop results in allocation and iteration of yet another seq inside of your seq.
Piping through Seq APIs Yes, this will create new seqs at each step. If the "real work" involved is very tiny like this example, then overhead starts to dominate.
Getting faster Your subsequent optimizations each remove layers of abstraction, and so although I don't have precise explanations, it seems reasonable that they get a bit faster.
.AsEnumerable() That's pretty wacky, I can repro the significant speedup that you see. Very strange given that the AsEnumerable
extension method does nothing but return its input directly!
The structure of the generated code in these case is very different. Maybe this is a pathological case in the optimizer somehow. Interesting find.
Variations I found that results vary quite significantly when you enable/disable optimizations, and when you target x64 vs x86. Take that for what it's worth.
Update after changed benchmarks and requirements from OP
Array.toSeq It is unnecessary to use Array.toSeq
here, and will predictably drag down performance of my suggested solution. Array.toSeq
and Seq.ofArray
are there more for safety (make sure resulting seq can't be converted back to array by consumer and mutated), than type conversion.
Better choices:
seq<_>
when returning it#seq<'t>
, then even a plain array is okUpdated Requirements Given newly-revealed constraints:
then it's clear a lazy approach is going to be more appropriate, at least in some cases.
Yet even with those requirements, in my testing with your new benchmarks, non-lazy solutions still perform very well in all cases except OOM or huge input with early bailout.
See my gist linked above for results. It includes alternative non-lazy implementations:
let eager2 (value: string) =
let result = ResizeArray(value.Length)
for i in 0 .. value.Length - 1 do
if not (Char.IsLowSurrogate(value.[i]))
then result.Add(Char.ConvertToUtf32(value, i))
// cast result so that return type isn't array
(result.ToArray()) :> seq<_>
let eager3 (value: string) =
let result = ResizeArray(value.Length)
for i in 0 .. value.Length - 1 do
if not (Char.IsLowSurrogate(value.[i]))
then result.Add(Char.ConvertToUtf32(value, i))
// ToArray() causes another copy to be generated.
// Avoiding that is a win in large-input scenarios, but at a cost
// of otherwise slower processing
(result) :> seq<_>
Improving lazy solution
Here is a further optimization of the lazy approach, directly integrating all logic, avoiding use of the string enumerator, and avoiding recursion.
This guy actually seems to beat the non-lazy solutions in most cases!
let lazy5 (value : string) =
let inline getEnum() =
let i = ref -1
{ new IEnumerator<int> with
member __.Current = Char.ConvertToUtf32(value, !i)
interface System.Collections.IEnumerator with
member __.Current = box (Char.ConvertToUtf32(value, !i))
member __.MoveNext() =
incr i
if !i >= value.Length then false else
if not (Char.IsLowSurrogate(value.[!i])) then true else
incr i
!i < value.Length
member __.Reset() = failwith "reset"
interface IDisposable with
member __.Dispose() = () }
{ new IEnumerable<int> with
member __.GetEnumerator() = getEnum()
interface IEnumerable with
member __.GetEnumerator() = getEnum() :> IEnumerator }
Summary
The first while-based seq solution looks great and performs well given constraints. I've tried to give some context on why proposed alternatives might be slower, hopefully that's helpful. I managed to squeeze out a bit more perf by integrating everything into an explict IEnumerable
directly.
Depending on constraints and input, a non-lazy solution might be a good choice. I've suggested a few options here. As always, you will need to test in your real environment.