I used https://github.com/nviennot/core-to-core-latency to measure my CPU's (Intel(R) Core(TM) Ultra 7 268V) core-to-core latency and these are my results:
~/Developer/core-to-core-latency main
❯ cargo run -- --bench 1
Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.08s
Running `target/debug/core-to-core-latency --bench 1`
CPU: Intel(R) Core(TM) Ultra 7 268V
Num cores: 8
Num iterations per samples: 1000
Num samples: 300
1) CAS latency on a single shared cache line
0 1 2 3 4 5 6 7
0
1 103±0
2 96±0 94±0
3 96±0 96±0 95±0
4 219±1 218±0 212±0 210±0
5 177±2 158±2 185±2 159±2 60±0
6 157±2 172±6 153±2 186±4 55±0 50±0
7 142±1 145±0 181±2 209±0 54±0 46±0 45±0
Min latency: 45.3ns ±0.2 cores: (7,6)
Max latency: 218.9ns ±0.6 cores: (4,0)
Mean latency: 134.8ns
~/Developer/core-to-core-latency main
❯ lscpu -e
CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE MAXMHZ MINMHZ MHZ
0 0 0 0 0:0:0:0 yes 4900.0000 400.0000 955.0980
1 0 0 1 4:4:1:0 yes 4900.0000 400.0000 1400.4000
2 0 0 2 8:8:2:0 yes 5000.0000 400.0000 667.1180
3 0 0 3 12:12:3:0 yes 5000.0000 400.0000 1213.8459
4 0 0 4 64:64:8 yes 3700.0000 400.0000 1100.0450
5 0 0 5 66:66:8 yes 3700.0000 400.0000 1103.3210
6 0 0 6 68:68:8 yes 3700.0000 400.0000 400.0000
7 0 0 7 70:70:8 yes 3700.0000 400.0000 400.0000
The communication between core 4-7 (efficiency cores) are much faster than between the performance cores of 0-3.
I am wondering what is the reasoning for an efficiency core to have a lower core-to-core latency, and say I want to build a SPSC queue with a reader and a writer on separate core, would I expect 2 performance core, or 2 efficiency core to give me a higher throughput?
Intel E-cores come in clusters of 4 cores that share an L2 cache, and the coherency mechanism is optimized so everything can stay within that local L2 when 2 E-cores in the same cluster are sharing a line.
(e.g. a read-for-ownership after an L1d store or RMW miss on one E-core can get the line from L2 if it's in Modified or Exclusive state there, without having to wait for communication with farther-away caches.)
P-cores each have a private L2 and only share an L3, so their interconnect is the ring bus that connects P cores, E-core-clusters, memory controllers, and the system agent (PCIe etc.) Each L3 slice is next to a core on the ring bus.
Your CPU only has 1 cluster of E cores; CPUs with more would have higher latency between E-cores in separate clusters.
https://chipsandcheese.com/p/examining-intels-arrow-lake-at-the discusses inter-core latency (and has some benchmarks) from a Core Ultra 9 285K (Arrow Lake desktop with multiple E-core clusters) and a Core Ultra 7 258V (Lunar Lake like yours with 4P4E). They found that surprisingly, inter-core latency is worse between P-cores than between different clusters of E-cores. The article speculates some on possible reasons.
Fun tidbit from that article: Skymont cores on Lunar Lake sat on a low power island away from the ring bus, and saw high L2 miss latency even if the request was serviced from Lunar Lake’s 8 MB memory side cache So E to P latency may be worse on your CPU than on Arrow Lake. Indeed your results and Chester Lam's both show the worst case is E-to-P on that CPU, not P-to-P.
It makes sense given the cache hierarchy that communication between two E cores in the same cluster is lower latency than having to go to L3 to get between P cores, or from P to E. It is interesting, though, and a bit surprising how much faster it is. Your lstopo
diagram shows this cache topology correctly.
say I want to build a SPSC queue with a reader and a writer on separate core, would I expect 2 performance core, or 2 efficiency core to give me a higher throughput?
Are your threads going to do anything besides hammer on the queue as fast as possible and wait for atomic operations? If your SPSC queue does its job and lets computation overlap with communication, with computation being the bottleneck, the P cores should be faster. Unless the higher latency there makes it so slow that the queue becomes a bottleneck where it wasn't on the E cores.
(Especially) If you're not doing useful work, it depends some on how you design your queue. If you can make it not very sensitive to round-trip latency, then bandwidth between cores or cache lines per cycle (or per ns since clock speed matters) might be the bigger concern.
For example in a garbage-collected language, maybe you can just use a linked-list so the producer thread doesn't usually have to modify any memory that the consumer is reading, if it stays ahead.
Even for an array-based circular buffer, if the writer can get several updates done while the reader is stalled waiting to read an atomic index, and vice versa, throughput is hopefully limited more by local atomic-RMW throughput for hot cache lines, not round trips. Actually for an SPSC queue, you often don't need RMWs to be atomic; on a value where there are readers but you're the only writer, tmp = write_pos.load(relaxed) + 1; write_pos.store( tmp, release );
is safe and good. (Avoiding seq_cst
memory_order is essential for the store to be fast on x86, and for the load to be fast on AArch64 when there are earlier release stores.)