rustnalgebra

Clamp matrix cells in nalgebra


The following code clamps each cell in a nalgebra square DMatrix. Is there a way to do this in a more "vectorized" fashion, say, using SIMD instructions?

        for r in 0..m.shape().0 {
            for c in 0..m.shape().1 {
                m[(r, c)] = nalgebra::clamp(m[(r, c)], -1.0, 1.0);
            }
        }

I am not interested in parallelization. The matrix is f64 type, it can have size up to 250 x 250. It is a DMatrix, hence its storage type VecStorage.

In case the answer is "no, this is the standard way of doing things in nalgebra" then I'll accept that with enough evidence. I just want to make sure that I'm not doing something wrong. Coming from a numpy / Python background, vectorized form of these type of bulk operations is preferred/more performant than iterating over arrays.


Solution

  • I expected this code to not auto-vectorize, but to my surprise, it did!

    Still, there is a faster version (around 2x). Actually, two of them:

    #[inline]
    pub fn naive(m: &mut DMatrix<f32>) {
        for r in 0..m.shape().0 {
            for c in 0..m.shape().1 {
                m[(r, c)] = m[(r, c)].clamp(-1.0, 1.0);
            }
        }
    }
    
    #[inline]
    pub fn apply(m: &mut DMatrix<f32>) {
        m.apply(|v| *v = v.clamp(-1.0, 1.0));
    }
    
    #[inline]
    pub fn as_slice(m: &mut DMatrix<f32>) {
        for v in m.as_mut_slice() {
            *v = v.clamp(-1.0, 1.0);
        }
    }
    

    apply() and as_mut_slice() have the same performance (even though I expected the latter to vectorize better, and thus be faster), so you should probably use apply(), as it is more generic and supports e.g. views (it will probably be slower with them, though, because it won't vectorize).

    nalgebra's clamp() generates worse assembly than f32/64::clamp(), so I used the latter. It indeed performs slightly better in the benchhmark.

    Benchmark for a matrix of 1000x1000, on my computer that has AVX2:

    naive                   time:   [1.5351 ms 1.5599 ms 1.5845 ms]
    
    apply                   time:   [715.58 µs 724.89 µs 735.01 µs]
    
    as_slice                time:   [712.68 µs 721.35 µs 731.02 µs]
    
    naive_nalgebra_clamp    time:   [1.5515 ms 1.5681 ms 1.5877 ms]
    
    apply_nalgebra_clamp    time:   [756.30 µs 763.27 µs 771.27 µs]
    
    as_slice_nalgebra_clamp time:   [735.29 µs 743.23 µs 752.35 µs]
    

    Remember to set -C target-feature=<highest_target_feature_you_can_allow> for best performance.