x86intelcpu-cacheinstructionsxeon-phi

What is the purpose of `_mm_clevict` intrinsic and corresponding clevict0, clevict1 instructions?


Intel® Intrinsics Guide says about _mm_clevict:

void _mm_clevict (const void * ptr, int level)
#include <immintrin.h>
Instruction: clevict0 m8
             clevict1 m8
CPUID Flags: KNCNI

Evicts the cache line containing the address ptr from cache level level (can be either 0 or 1).

What could be the purpose of this operation? Is it different from _mm_cldemote?


Solution

  • As far as I can tell, these instructions were added to the 1st generation Xeon Phi (Knights Corner, KNC) processors to help deal with some very specific performance issues for data motion through the cache hierarchy. It has been quite a while since I looked at the details, but my recollection is that there were some performance problems associated with cache victims, and that throughput was improved if the no-longer-needed lines were evicted from the caches before the cache miss that would cause an eviction.

    Idea (1): This might have been due to memory bank conflicts on dirty evictions. E.g., consider what would happen if the address mapping made it too likely that the new item being loaded would be located in a DRAM bank that conflicted with the victim to be discarded. If there were not enough write buffers at the memory controller, the writeback might have to be committed to DRAM before the DRAM could switch banks to service the read. (Newer processors have lots and lots of write buffers in the memory controller, so this is not a problem, but this could have been a problem for KNC.)

    Idea (2): Another possibility is that the cache victim processing could delay the read of the new value because of serialization at the Duplicate Tag Directories (DTDs). The coherence protocol was clearly a bit of a "hack" (so that Intel could use the existing P54C with minimal changes), but the high-level documentation Intel provided was not enough to understand the performance implications of some of the implementation details.

    The CLEVICT instructions were "local" -- only the core executing the instruction performed the eviction. Dirty cache lines would be written out and locally invalidated, but the invalidation request would not be transmitted to other cores. The instruction set architecture documentation does not comment on whether the CLEVICT instruction results in an update message from the core to the DTD. (This would be necessary for idea (2) to make any change in performance.)

    The CLDEMOTE instruction appears to be intended to reduce the latency of cache-to-cache transfers in producer-consumer situations. From the instruction description: "This may accelerate subsequent accesses to the line by other cores in the same coherence domain, especially if the line was written by the core that demotes the line." This is very similar to my patent https://patents.google.com/patent/US8099557B2/ "Push for sharing instruction" (developed while I was at AMD).