I wrote an unoptimized C source code for matrix multiplication, and I want to test the optimization capabilities of the Clang compiler.
void MatrixMul(unsigned int N, int *C, int *A, int *B) {
unsigned int i, j, k;
for (i = 0; i < N; i++) {
for (j = 0; j < N; j++) {
C[i * N + j ] = 0;
for (k = 0; k < N; k++) {
C[i * N + j] += A[i * N + k] * B[k * N + j];
}
}
}
}
The assembly code for the innermost loop might look like this (I made a few changes):
.LBB0_4: # Parent Loop BB0_2 Depth=1
# Parent Loop BB0_3 Depth=2
# => This Inner Loop Header: Depth=3
slli t0, a4, 32
slli t2, a5, 32
srli t0, t0, 30 # id1 * 4
srli t2, t2, 30 # id2 * 4
add t0, t0, a2
add t2, t2, a3
lw t0, 0(t0) # &A[id1 * 4]
lw t2, 0(t2) # &B[id2 * 4]
addi a4, a4, 1 # id1++
mul/mulw t0, t2, t0
add a5, a5, a0 # id2 += N
add t5, t5, t0
sw t5, 0(t4)
bne a4, t6, .LBB0_4 # if t6 == endIdx
However, when I change mul
to mulw
, I find that it runs faster than before. I don't understand why. According to the RISC-V manual, mul
multiplies rs1 and rs2, while mulw
multiplies rs1 and rs2 and truncates the result to 32 bits.
Is this phenomenon platform-specific? Or is mulw
inherently easier to optimize, and is this behavior consistent across all chips?
mulw
is a 32-bit operation and mul
is 64 bit and since you're working with 32-bit integers(int), the upper 32 bits are not used anyways