std::replace
implementation can be optimized using vectorization (by specializing the library implementation or by the compiler).
The vectorized implementation would compare and replace several elements at once.
To make sure old values preserved where needed, the vectorized implementation has to do either of following:
The first approach looks the best, but does not work if the vector instruction set is lacking masked store with given element size. On x86, the SSE family does not have masked stores1, AVX/AVX2 has only 32 and 64 bit element masked stores, and only if some AVX512 subsets are available, every vector element size masked store is available.
The second approach assumes that the element can be written, if the same value is written. Which is the assumption I question. The writability of memory is not always granted. In C++, one can pass constant array with casted away const
modifier to std::replace
, and expect the correct no-operation if nothing is to replaced. With OS memory management function, the array can be made partially-writable, partially-read-only.
The third approach doesn't need masked store, doesn't make writability assumption. But it also doesn't fully benefit from vectorization.
I observed different approaches being used by different implementations. Both x86-64 icc 2021.10.0
compiler and x86-64 clang 17.0.1
attempt to vectorize std::replace
with 16-bit elements with AVX2 enabled, but Intel uses write old values approach, whereas clang uses individual stores approach. Used the following code and -mavx2
option.
#include <algorithm>
void f(short* i, short* e)
{
std::replace(i, e, 0, 1);
}
See https://godbolt.org/z/oarPhPhTs.
So it looks like as either x86-64 icc 2021.10.0
is broken or x86-64 clang 17.0.1
is missing further optimization opportunity.
Besides vectorization itself, redundant writes may be used to optimize std::replace
for small arrays or for vectorization tail handling: branchless code may want to compute the destination value (e.g. using cmovcc
on x86) and store it unconditionally,
So my question is - are compilers and standard library implementations allowed to make redundant stores to optimize std::replace
?
This question is different from near-duplicate Crash with icc: can the compiler invent writes where none existed in the abstract machine? in that it is about the standard algorithm std::replace
, not a particular its implementation or hand-written equivalent.
1 Non-temporal masked stores don't count, as they are not useful for general-purpose optimizations
My understanding is that redundant writes are not allowed.
[algorithms.requirements]/3 says:
For purposes of determining the existence of data races, algorithms shall not modify objects referenced through an iterator argument unless the specification requires such modification.
I can imagine a case where a redundant write by std::replace
modification introduces data races.
Consider some array with some of elements of x
values, and some other values. Let's assume it is divided into subranges by x
values, and the x
values themselves are not included into that subranges. Now if we run a non-modifying algorithm on each subrange in one thread, and std::replace
in another thread, to replace x
with y
, we should not have a data race. With redundant writes we will have.