c++optimizationparallel-processingdot-product

Optimizing a Frequently Called dot product Function in C++


I have a function in C++ that calculates the delta energy for a given lattice and interaction matrix. The function works even for non-symmetric interaction matrices. Here is the code:

double delta_energy(const std::vector<int>& lattice, const std::vector<std::vector<double>>& J, int k) {
    int S_k = lattice[k];
    double interaction_sum = 0;
    for(int i = 0; i < lattice.size(); ++i) {
        interaction_sum += J[i][k] * lattice[i] + J[k][i] * lattice[i];
    }
    interaction_sum -= 2 * J[k][k] * lattice[k];
    double delta_E = -2 * S_k * interaction_sum;
    return delta_E;
}

The code is fairly simple and basically calculates a dot product using a for loop. I've profiled my code. This function is called many times in my code and has become the bottleneck. I tried to optimize this function using OpenMP for parallel computing, but it didn't work (it takes even longer, even for large matrices, which doesn't make sense for me). I'm looking for alternative ways to optimize this function, or at least get a sense on how to make this function faster.


Solution

  • The main bottleneck of this code should be J[i][k] by a large margin. Indeed, the loop iterator is i so memory access are non-contiguous and the stride between accesses is huge. Even worse, the stride is variable due to the vector<vector<double>> which is inefficient as mentioned in @NathanOliver in the comments. Flattening the matrix helps to reduce the memory access overhead so the CPU can better prefetch data but the code is still very inefficient due to the large stride. Because of that it should be clearly memory-bound. Thus, the key is to optimize memory accesses.

    TO understand how bad the situation is, one need to know that the CPU load a chunk of typically 64 bytes when one item is fetched from the main memory. This chunk is called a cache-line. On top of that, if the contiguous dimension of J is a power of two, typically like the size of the CPU cache, there will be a lot of conflict misses in the CPU cache. Thus, the cache will be poorly used and the latency of each fetch will be high. Due to the large-strided accesses, the CPU needs to reload 8 times the same cache-lines from the main memory assuming J is too large to fit in the cache (which is likely the case since lattice.size() is said to be "in the order of thousands, or tens of thousands" and the matrix shape is expected to be squared due to the transposed access). Here is a related post showing how conflict misses can cause a huge cache-trashing overhead.

    One solution is to follow the Data Oriented Design, that is, to compute multiple k items in delta_energy and return multiple results (by storing the resulting value in a fixed-size static array). This solution enable to reuse cache-lines multiple times before they are evicted. It is similar to the one proposed by @ThomasMatthews in the comments. It is not clear how you call delta_energy, but if you call it in a loop with independent iterations, then it is better to compute the two loops in the same function and use tiling as proposed. In fact, one can also use Lebesgue curves to speed things up with a cache oblivious algorithm (though this is not trivial to implement).

    An alternative solution is to pre-compute the transposed matrix of J so J[i][k] accesses can be contiguous. This solution is sub-optimal as it requires more data to be read from memory and it consume more memory too. However, it is certainly simpler to implement. The transposition can be done efficiently using a BLAS library or any optimized matrix library (e.g. Eigen). You can then write:

    // JT is the transposed contiguous matrix of J
    interaction_sum += JT[k][i] * lattice[i] + J[k][i] * lattice[i];
    

    Now the accesses are contiguous thanks to the transposition, the code can be auto-vectorized by optimizing compilers assuming fast-math flags are enabled (-ffast-math on Clang/GCC). Otherwise, a compiler will not vectorize efficiently the code because FP-math is not associative. Be aware that -ffast-math (or equivalent flags) can be dangerous though. OpenMP SIMD directives can help to vectorize the code without the possibly-dangerous flag though is is not always efficient on all compiler yet. Please read this post for more information about this.

    Moreover, you can factorize the expression interaction_sum += J[i][k] * lattice[i] + J[k][i] * lattice[i]; to interaction_sum += (JT[k][i] * + J[k][i]) * lattice[i];.

    Last but not least, you can also pre-compute an array lattice_double which is the double version of lattice so to avoid doing many int to double casts. However, it increase the size of the array which may not fit in the L1 cache any-more regarding its actual size though the L2 is generally fast too and I expect J and JT loads to be the bottleneck anyway.

    The resulting code will certainly be still memory-bound but the amount of data loaded from the RAM should be much smaller and contiguous memory accesses will certainly cause the code to be bound by the memory throughput (with no visible memory latency overhead).