The following code clamps each cell in a nalgebra square DMatrix
. Is there a way to do this in a more "vectorized" fashion, say, using SIMD instructions?
for r in 0..m.shape().0 {
for c in 0..m.shape().1 {
m[(r, c)] = nalgebra::clamp(m[(r, c)], -1.0, 1.0);
}
}
I am not interested in parallelization.
The matrix is f64
type, it can have size up to 250 x 250. It is a DMatrix
, hence its storage type VecStorage
.
In case the answer is "no, this is the standard way of doing things in nalgebra" then I'll accept that with enough evidence. I just want to make sure that I'm not doing something wrong. Coming from a numpy / Python background, vectorized form of these type of bulk operations is preferred/more performant than iterating over arrays.
I expected this code to not auto-vectorize, but to my surprise, it did!
Still, there is a faster version (around 2x). Actually, two of them:
#[inline]
pub fn naive(m: &mut DMatrix<f32>) {
for r in 0..m.shape().0 {
for c in 0..m.shape().1 {
m[(r, c)] = m[(r, c)].clamp(-1.0, 1.0);
}
}
}
#[inline]
pub fn apply(m: &mut DMatrix<f32>) {
m.apply(|v| *v = v.clamp(-1.0, 1.0));
}
#[inline]
pub fn as_slice(m: &mut DMatrix<f32>) {
for v in m.as_mut_slice() {
*v = v.clamp(-1.0, 1.0);
}
}
apply()
and as_mut_slice()
have the same performance (even though I expected the latter to vectorize better, and thus be faster), so you should probably use apply()
, as it is more generic and supports e.g. views (it will probably be slower with them, though, because it won't vectorize).
nalgebra
's clamp()
generates worse assembly than f32/64::clamp()
, so I used the latter. It indeed performs slightly better in the benchhmark.
Benchmark for a matrix of 1000x1000, on my computer that has AVX2:
naive time: [1.5351 ms 1.5599 ms 1.5845 ms]
apply time: [715.58 µs 724.89 µs 735.01 µs]
as_slice time: [712.68 µs 721.35 µs 731.02 µs]
naive_nalgebra_clamp time: [1.5515 ms 1.5681 ms 1.5877 ms]
apply_nalgebra_clamp time: [756.30 µs 763.27 µs 771.27 µs]
as_slice_nalgebra_clamp time: [735.29 µs 743.23 µs 752.35 µs]
Remember to set -C target-feature=<highest_target_feature_you_can_allow>
for best performance.