I compared the performance of 3x3 and 4x4 matrix multiplication using Eigen with the -O3 optimization flag, and surprisingly, I found that the 4x4 case is more than twice as fast as the 3x3 case! This result was unexpected, and I'm curious to understand why.
The benchmark code I used was compiled with the following command:
g++ -O3 -I/usr/include/eigen3 tmp.cpp
.
Here’s the code snippet for the benchmark:
#include <Eigen/Dense>
#include <Eigen/Geometry>
#include <iostream>
#include <vector>
#include <chrono>
#include <random>
#include <valgrind/callgrind.h>
void benchmark_matrix3d_multiplication(const std::vector<Eigen::Matrix3d>& matrices) {
auto start = std::chrono::high_resolution_clock::now();
Eigen::Matrix3d result = Eigen::Matrix3d::Identity();
for(size_t i = 0; i < matrices.size(); i++) {
asm("# 3dmat mult start");
result = result * matrices[i];
asm("# 3dmat mult end");
}
auto end = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end - start);
std::cout << "Matrix3d: " << duration.count() << " microseconds" << std::endl;
std::cout << result << std::endl; // to avoid compiler optimization
}
void benchmark_matrix4d_multiplication(const std::vector<Eigen::Matrix4d>& matrices4) {
auto start = std::chrono::high_resolution_clock::now();
Eigen::Matrix4d result = Eigen::Matrix4d::Identity();
for(size_t i = 0; i < matrices4.size(); i++) {
asm("# 4dmat mult start");
result = result * matrices4[i];
asm("# 4dmat mult end");
}
auto end = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end - start);
std::cout << "Matrix4d: " << duration.count() << " microseconds" << std::endl;
std::cout << result << std::endl; // to avoid compiler optimization
}
void benchmark_matrix_multiplication() {
const int num_matrices = 100000;
std::vector<Eigen::Matrix3d> matrices(num_matrices);
for(size_t i = 0; i < num_matrices; i++) {
auto q = Eigen::Quaterniond::UnitRandom();
matrices[i] = q.toRotationMatrix();
}
std::vector<Eigen::Matrix4d> matrices4(num_matrices);
for(size_t i = 0; i < num_matrices; i++) {
matrices4[i].block<3,3>(0,0) = matrices[i];
}
CALLGRIND_START_INSTRUMENTATION; // start callgrind
benchmark_matrix3d_multiplication(matrices);
benchmark_matrix4d_multiplication(matrices4);
}
int main() {
benchmark_matrix_multiplication();
}
The result is the following and confirmed that, the two results match each other.
Matrix3d: 2008 microseconds
0.440386 -0.897765 -0.00889435
0.808307 0.400777 -0.431298
0.390768 0.182748 0.902166
Matrix4d: 833 microseconds
0.440386 -0.897765 -0.00889435 0
0.808307 0.400777 -0.431298 0
0.390768 0.182748 0.902166 0
0 0 0 0
Note that I added asm("# 3dmat mult start")
and similar markers to identify the relevant sections of the assembly code. Based on the assembly output, it appears that the multiplication operations are indeed being performed and not optimized away by the compiler. Interestingly, the assembly code for the 4x4 case has more lines of instructions than the 3x3 case, yet it executes faster.
Could someone explain why the 4x4 matrix multiplication is significantly faster in this scenario? Is there a mistake in my measurement, or is there an underlying reason for this result?
lscpu result of my machine https://gist.github.com/HiroIshida/7230aab2d40a6cae7b18c552a363f31f
g++ version is: g++ (Ubuntu 13.1.0-8ubuntu1~20.04.2) 13.1.0
asm file https://gist.github.com/HiroIshida/284f3a006b96a6ee9b1004ead96095dd
corresponding asm code for 3x3 multiplication extracted from the above https://gist.github.com/HiroIshida/5a3bf192b58467a0fba823b9489c0421
corresponding asm code for 4x4 multiplication extracted from the above https://gist.github.com/HiroIshida/9176c7a91e30f7e426e927412740f615
I didn't dig deeper, but in my ARM machine, expectedly, the 3x3 matrix multiplication is faster than 4x4
TL;DR: there are several issues happening simultaneously, but the main one is that the 3D version has a higher latency (due to the scheduling and register file issues) which cannot be overlapped with computation due to the loop-carried dependency. Changing the order of the operations helps the CPU to hide the latency of the 3D version so it can be faster than the 4D one. Other issues impacting the stability of the benchmark are frequency scaling (and more generally power management) and memory alignment.
I can reproduce this on an AMD Ryzen 7 5700U CPU (Zen2) with GCC 13.2.0 on Ubuntu 24.04. The performance results are very unstable for one run (since the benchmark is too short). That being said, when running the program many times (e.g., 1000 times), interesting things start to happen and the weird effect surprisingly appears!
First of all, here are the results on my machine for 100 runs without any modification of the program:
Note that the vertical axis is the reported time (in µs) and the horizontal axis is the number of iterations.
We can see that the 3D version is less stable than the 4D version, but the noise is so big that we cannot conclude anything (maybe just that the 3D version is less stable than the 4D one and there is a kind of strange threshold in the timings of the 3D version).
If we increase the number of runs to 1000, then we get surprising results:
Ok, now results are far more stable (less random), more deterministic (similar results between many group of runs) and also more reliable (because the benchmark runs for dozens of seconds).
We can now say that the 3D version is indeed slower on average and less stable than the 4D version. Interestingly, we can see that the 4D version is surprisingly faster after a given time contrary to the 3D version.
Note that using -march=native
makes computations faster (thanks to AVX/AVX2) but it has no impact on the issues overall. Also, please note that -march=native
and -mavx2 -mfma
give the same performance results.
Note that a similar effect can be seen with 20 times bigger vectors. The effect is also independent of the thread binding so this is certainly not due to context switches or NUMA effects. Changing the order of the function calls did not result in any significant performance impact. The scaling governor is set to performance
on all cores (which may have a variable frequency though). Could it be due to the frequency scaling?
Well, it seems frequency scaling somehow plays a role in the initial threshold for the 4D version. I set the minimum and maximum frequency to the same value (800MHz on my machine because of a bug with the scaling governor on my machine that does not care about the minimum frequency I can actually set, so the minimum frequency tends to always be 800 MHz for active cores). I checked that the frequency was stable during the run (at least at a coarse granularity). Here is the result:
We can see that there is no threshold anymore for the 4D version! More generally, I observed two effects when trying different frequencies:
Regarding the last two points, this is actually because the threshold delay seems to be a fixed threshold delay independent of the frequency (about 2.5 seconds), so a lower frequency makes each iteration slower causing a smaller delay. All of this proves that frequency impacts the threshold though I cannot explain why. My guess is that this is related to the power management of AMD Zen CPUs. I expect this constant to be either set in the kernel (possibly in vendor dependent files) or a hardware/firmware constant based on off-core frequency (independent of the core frequency).
I also found out (lately) that the effect does not happen when the laptop is charging, though the performance is a bit less stable! This confirms this issue is directly related to the power management of the CPU.
When the execution time of a function doing basic numerical computation is quantized (and the effect is not due to frequency), the quantization is often caused by memory alignment issues. This is what happens for the 3D version, typically because a double-precision 3x3 matrix takes 3*3*8=72
bytes (as pointed by Raildex in the comments). In practice, compilers like GCC can add some padding at the end of the structure for memory accesses to be typically faster (resulting in 80-byte 3x3 matrices). Note that this is not a power of two. Moreover, a 3x3 matrix typically crosses two cache lines causing additional loads from the L1 cache during unaligned accesses. Such an additional load introduction results in a higher latency. Even worse: the latency overhead is not constant, so I think this might make it harder for the CPU instruction scheduler to schedule the target instructions efficiently. To check the memory-alignment hypothesis, we can simply align each matrix on a cache line. This is not really efficient in terms of memory space (so it can decrease performance of memory-bound codes), but it should at least fix this alignment issue. Here is the code to force matrices to be aligned in memory:
// Align each matrix to 64 bytes (i.e., a cache line)
class alignas(64) Matrix3d : public Eigen::Matrix3d
{
public:
Matrix3d() = default;
Matrix3d(const Matrix3d&) = default;
Matrix3d(Matrix3d&&) = default;
template<typename MatrixType>
Matrix3d(const MatrixType& other) : Eigen::Matrix3d(other) {}
template<typename MatrixType>
Matrix3d& operator=(const MatrixType& other) { Eigen::Matrix3d::operator=(other); return *this; }
};
class alignas(64) Matrix4d : public Eigen::Matrix4d
{
public:
Matrix4d() = default;
Matrix4d(const Matrix4d&) = default;
Matrix4d(Matrix4d&&) = default;
template<typename MatrixType>
Matrix4d(const MatrixType& other) : Eigen::Matrix4d(other) {}
template<typename MatrixType>
Matrix4d& operator=(const MatrixType& other) { Eigen::Matrix4d::operator=(other); return *this; }
};
// Replace all occurrences of Eigen::Matrix by Matrix in the rest of the code
// [...]
Here are the performance results:
We can see that the timings of the 3D version are not quantized anymore. Note that the timing is unfortunately the same as the upper threshold in previous benchmarks. One can also note that the 4D version is a bit slower (possibly due to a less efficient generated code). Strangely, the timings of the 4D version are less stable. I expect the alignment of the array allocated by std::vector
to impact such timings.
Once combined with the fixed (low) frequency, we get the following performance results:
Results are now stable, but there is still one big issue: why is the 4D version still faster than the 3D one?
Your functions have a special property: each iteration of the hot loop is dependent on the previous one. As a result, it is much more difficult for the CPU to hide the latency of the instructions. It is also difficult to find enough instructions to execute in parallel. Because of that, the performance of the code should be very sensitive to the latency of the instructions and also any overhead impacting it. I think this is why there are so many strange effects on this simple code and also why the results are so unstable.
The scheduling of the instructions can have a strong impact on performance. This is especially true on Zen2 CPU which AFAIK uses multiple separate FP schedulers (not a unified scheduler). Moreover, some instructions can only be scheduled on some specific ports. Thus, if the scheduler makes a bad decision by not queuing instructions to the best queue for a given iteration, the latency can slightly increase (probably by a few cycles), so the overall execution will be slower. A few cycles seems small but a few cycles per iteration when each iteration lasts dozens of cycles is a lot. Here, on the last 800MHz benchmark, we can see that the 4D version takes about 4800 µs to execute 100_000 iterations. This means ~38 cycles/iteration (while there are about 140-150 instructions/iteration to execute)!
In practice, hardware performance counters tend to indicate poor back-end resource usage (on a modified example meant to measure each version separately):
Use of the 4 FP SIMD ports (i.e., execution units):
- 3D version:
467 703 544 fpu_pipe_assignment.total0
940 583 576 fpu_pipe_assignment.total1
22 580 911 fpu_pipe_assignment.total2
10 002 224 fpu_pipe_assignment.total3
- 4D version:
334 752 204 fpu_pipe_assignment.total0
280 487 188 fpu_pipe_assignment.total1
269 800 454 fpu_pipe_assignment.total2
268 277 318 fpu_pipe_assignment.total3
In the 3D version, two ports are nearly not used at all. For the two remaining ports, one is twice as saturated as the other! This is really bad. Meanwhile, the 4D version performs well. This tends to confirm instruction scheduling issues.
The hardware performance counters also indicate the reason for dispatched instruction stalls. Here are the two main sources of stalls in the two versions:
Source of stalls (in cycles):
- 3D version:
7 936 000 400 cycles
1 609 456 587 de_dis_dispatch_token_stalls1.fp_sch_rsrc_stall
3 998 358 801 de_dis_dispatch_token_stalls1.fp_reg_file_rsrc_stall
- 4D version:
6 950 481 062 cycles
194 523 072 de_dis_dispatch_token_stalls1.fp_sch_rsrc_stall
2 858 781 759 de_dis_dispatch_token_stalls1.fp_reg_file_rsrc_stall
fp_sch_rsrc_stall
specifies "cycles where a dispatch group is valid but does not get dispatched due to a token stall. FP scheduler resource stall.". While fp_reg_file_rsrc_stall
is similar but for "floating point register file resource stall.". Thus, on the 3D version, while some stalls (20%) happen due to scheduling issues, the main source of stalls (50%) seems to come from the FP register file. Generally, this is due to missing entries in the FP register file, but it would be weird here since Zen2 is supposed to have 160 FP registers (which seems sufficient for this code). I wonder if it could be somehow due to speculative execution of FP instructions. On the 4D version, there are also a lot of FP register stalls (41%), and nearly no stalls due to the scheduling. While I do not fully understand the reason why there are so many FP register file stalls, the later point confirms the above hypothesis.
In practice you should really avoid writing such code for several reasons. First of all, it is certainly not numerically stable (because of the large number of matrix multiplications performed iteratively).
Moreover, the dependency between iterations typically decreases performance. A key to speeding up this computation is to compute a pair-wise reduction (i.e., something like (A*A)*(A*A)
). This helps the CPU to compute multiple matrix multiplications concurrently (possibly even in parallel), reducing scheduling stalls. Besides, a pair-wise reduction is much more numerically stable. This improvement make the 3D version faster compared to the 4D version! Here are performance results when matrices are computed pair by pair (for both versions):
We can see that the 3D version is now faster and the 4D version is not affected much by this improvement. Note that computing matrices by groups of 4 decreases performance since the generated code seems to be less efficient for my CPU (due to instruction scheduling and too many required registers).
FMA instructions (i.e., instructions fusing a multiplication with an addition) can increase the accuracy of the computation while reducing the latency of the code. You can specify -mfma
to the compiler so as to generate FMA instructions. However, as stated previously, it does not impact the difference between the 3D and 4D version.
Adding padding in the 3D matrix can help to perform possibly fewer (aligned) loads/stores, avoiding some latency overheads (AFAIK storing data to a cache-line can impact the speed of reads on the same cache line but at different locations on some CPUs). Compilers tends not to do that because they need to prove the access is safe and operations do not cause any side effects on non-masked values (not to mention masking can be expensive on some platforms).
For basic 3x3 and 4x4 matrices, the best solution might be to manually write assembly code for the target architecture despite the many downsides (not simple, tedious to write, not really maintainable, not portable).