We figured that processing each "tail permutation" in parallel would decrease the real running time of this algorithm. Why does it not? How if at all can we configure PSeq to improve the running time?
let rec permute (xs:'t list) =
if xs.Length = 1 then [xs]
else ([], permute xs.Tail)
// Why do we not get a performance
// improvement when we use PSeq here
// relative to using Seq?
||> PSeq.fold (fun acc ps ->
(insertAtEach xs.Head ps)@acc)
permute ["A";"B";"C"] |> ignore
We figured that the permutations work would be split as follows, and that the insertAtEach
phase of the algorithm would work in parallel.
[C]
[BC] [CB]
[ABC] [BCA] [BCA] [ACB] [CAB] [CBA]
CPU01 CPU02
It is actually slower in parallel even when we use a large initial list such as permute [0..9]
. We have not not figured out whether withDegreeOfParallelism
and related PSeq options will help.
Here is the rest of the code listing:
let put a q xs =
([], xs)
||> Seq.fold (fun acc x ->
if x = q then a::x::acc
else x::acc)
|> List.rev
// Insert x at each possible location in xs
let insertAtEach a xs =
([a::xs], xs)
||> Seq.fold (fun acc x ->
(put a x xs)::acc)
PSeq.fold
doesn't do what you think it does. PSeq.fold
is actually not parallel at all. It's sequential.
You can't just throw the word "parallel" somewhere in the middle of your algorithm and hope for the best. That's not how parallelization works. You have to actually understand what's going on, what happens in parallel, what is sequential, what can be parallelized in principle, and what cannot.
Take the fold
for example: it applies the provided folding function to every element of the sequence and the result of the previous call. Since every next call must have the result of the previous call before it can execute, it's pretty obvious that fold
cannot be carried out in parallel at all. Has to be sequential. And in fact, that's what PSeq.fold
actually does if you look at its source code. So you get some overhead for converting to a ParallelSequence
every time, but no gain.
Now, if you look at your algorithm carefully, you can tease out the parallelizable part. What your algorithm does:
When you put it like this, it's easy to see that step 2 doesn't really depend on anything but itself. Every tail permutation is processed completely separately from all others.
Of course, this isn't evident just from looking at your source code, because you've combined steps 2 and 3 in one expression (insertAtEach xs.Head ps)@acc
, but it's easy to tease apart using the following general identity:
xs |> fold (fun a x -> g a (f x))
===
xs |> map f |> fold (fun a x -> g a x)
That is, instead of applying function f
to every element x
during the fold
, you can apply it to every element "in advance" using map
.
Applying this idea to your algorithm, you get:
let rec permute (xs:'t list) =
if xs.Length = 1 then [xs]
else permute xs.Tail
|> Seq.map (insertAtEach xs.Head)
|> Seq.fold (fun acc ps -> ps@acc) []
Now it's easy to see that the Seq.map
step is the parallelizable one: map
applies the function to each element independently of other elements, so it can in principle work in parallel. Just replace Seq
with PSeq
and you got yourself the parallel version:
let rec permute (xs:'t list) =
if xs.Length = 1 then [xs]
else permute xs.Tail
|> PSeq.map (insertAtEach xs.Head)
|> PSeq.fold (fun acc ps -> ps@acc) []
Whether or not PSeq
actually does perform map
in parallel is another matter, but that should be easily verifiable empirically.
And indeed, on my machine, the parallel version consistently outperforms the sequential version for lists of 7 to 10 elements (11-element list caused an OOM).
P.S. keep in mind that, when measuring time, you need to force the resulting sequence first (e.g. by converting it to a list or taking Seq.last
). Otherwise all you're measuring is just the parallelization overhead cost.