[SOLVED] Is mulw faster than mul on riscv 64-bit platforms?

Is mulw faster than mul on riscv 64-bit platforms?

I wrote an unoptimized C source code for matrix multiplication, and I want to test the optimization capabilities of the Clang compiler.

void MatrixMul(unsigned int N, int *C, int *A, int *B) {
  unsigned int i, j, k;
  for (i = 0; i < N; i++) {
    for (j = 0; j < N; j++) {
      C[i * N + j ] = 0;
      for (k = 0; k < N; k++) {
        C[i * N + j] += A[i * N + k] * B[k * N + j];
      }
    }
  }
}

The assembly code for the innermost loop might look like this (I made a few changes):

.LBB0_4:                                #   Parent Loop BB0_2 Depth=1
                                        #     Parent Loop BB0_3 Depth=2
                                        # =>    This Inner Loop Header: Depth=3
    slli    t0, a4, 32
    slli    t2, a5, 32
    srli    t0, t0, 30                  # id1 * 4
    srli    t2, t2, 30                  # id2 * 4
    add t0, t0, a2
    add t2, t2, a3
    lw  t0, 0(t0)                           # &A[id1 * 4]
    lw  t2, 0(t2)                           # &B[id2 * 4]
    addi    a4, a4, 1                       # id1++
    mul/mulw t0, t2, t0
    add a5, a5, a0                      # id2 += N
    add t5, t5, t0
    sw  t5, 0(t4)
    bne a4, t6, .LBB0_4                 # if t6 == endIdx

However, when I change mul to mulw, I find that it runs faster than before. I don't understand why. According to the RISC-V manual, mul multiplies rs1 and rs2, while mulw multiplies rs1 and rs2 and truncates the result to 32 bits.

Is this phenomenon platform-specific? Or is mulw inherently easier to optimize, and is this behavior consistent across all chips?

Solution

mulw is a 32-bit operation and mul is 64 bit and since you're working with 32-bit integers(int), the upper 32 bits are not used anyways