I'm trying to understand the RIDL class of vulnerability.
This is a class of vulnerabilities that is able to read stale data from various micro-architectural buffers.
Today the known vulnerabilities exploits: the LFBs, the load ports, the eMC and the store buffer.
The paper linked is mainly focused on LFBs.
I don't understand why the CPU would satisfy a load with the stale data in an LFB.
I can imagine that if a load hits in L1d it is internally "replayed" until the L1d brings data into an LFB signalling the OoO core to stop "replaying" it (since the data read are now valid).
However I'm not sure what "replay" actually mean.
I thought loads were dispatched to a load capable port and then recorded in the Load Buffer (in the MOB) and there being eventually hold as needed until their data is available (as signalled by the L1).
So I'm not sure how "replaying" comes into play, furthermore for the RIDL to work, each attempt to "play" a load should also unblock dependent instructions.
This seems weird to me as the CPU would need to keep track of which instructions to replay after the load correctly completes.
The paper on RIDL use this code as an example (unfortunately I had to paste it as an image since the PDF layout didn't allow me to copy it):
The only reason it could work is if the CPU will first satisfy the load at line 6 with a stale data and then replay it.
This seems confirmed few lines below:
Specifically, we may expect two accesses to be fast, not just the one corresponding to the leaked information. After all, when the processor discovers its mistake and restarts at Line 6 with the right value, the program will also access the buffer with this index.
But I would expect the CPU to check the address of the load before forwarding the data in the LFB (or any other internal buffer).
Unless the CPU actually executes the load repeatedly until it detect the data loaded is now valid (i.e. replaying).
But, again, why each attempt would unblock dependent instructions?
How does exactly the replaying mechanism work, if it even exists, and how this interacts with the RIDL vulnerabilities?
I don't think load replays from the RS are involved in the RIDL attacks. So instead of explaining what load replays are (@Peter's answer is a good starting point for that), I'll discuss what I think is happening based on my understanding of the information provided in the RIDL paper, Intel's analysis of these vulnerabilities, and relevant patents.
Line fill buffers are hardware structures in the L1D cache used to hold memory requests that miss in the cache and I/O requests until they get serviced. A cacheable request is serviced when the required cache line is filled into the L1D data array. A write-combining write is serviced when the any of the conditions for evicting a write-combining buffer occur (as described in the manual). A UC or I/O request is serviced when it is sent to the L2 cache (which occurs as soon as possible).
Refer to Figure 4 of the RIDL paper. The experiment used to produce these results works as follows:
MFENCE
and there is an optional CLFLUSH
. It's not clear to me from the paper the order of CLFLUSH
with respect to the other two instructions, but it probably doesn't matter. MFENCE
serializes the cache line flushing operation to see what happens when every load misses in the cache. In addition, MFENCE
reduces contention between the two logical cores on the L1D ports, which improves the throughput of the attacker.It's not clear to me what the Y-axis in Figure 4 represents. My understanding is that it represents the number of lines from the covert channel that got fetched into the cache hierarchy (Line 10) per second, where the index of the line in the array is equal to the value written by the victim.
If the memory location is of the WB type, when the victim thread writes the known value to the memory location, the line will be filled into the L1D cache. If the memory location is of the WT type, when the victim thread writes the known value to the memory location, the line will not be filled into the L1D cache. However, on the first read from the line, it will be filled. So in both cases and without CLFLUSH
, most loads from the victim thread will hit in the cache.
When the cache line for a load request reaches the L1D cache, it gets written first in the LFB allocated for the request. The requested portion of the cache line can be directly supplied to the load buffer from the LFB without having to wait for the line to be filled in the cache. According to the description of the MFBDS vulnerability, under certain situations, stale data from previous requests may be forwarded to the load buffer to satisfy a load uop. In the WB and WT cases (without flushing), the victim's data is written into at most 2 different LFBs. The page walks from the attacker thread can easily overwrite the victim's data in the LFBs, after which the data will never be found in there by the attacker thread. All load requests that hit in the L1D cache don't go through the LFBs; there is a separate path for them, which is multiplexed with the path from the LFBs. Nonetheless, there are some cases where stale data (noise) from the LFBs is being speculatively forwarded to the attacker's logical core, which is probably from the page walks (and maybe interrupt handlers and hardware prefetchers).
It's interesting to note that the frequency of stale data forwarding in the WB and WT cases is much lower than in all of the other cases. This is could be explained by the fact that the victim's throughput is much higher in these cases and the experiment may terminate earlier.
In all other cases (WC, UC, and all types with flushing), every load misses in the cache and the data has to be fetched from main memory to the load buffer through the LFBs. The following sequence of events occur:
MFENCE
after every load, there can be at most one outstanding load in the LFB at any given cycle from the victim.If the attacker's load didn't fault/assist, the LFBs will receive a valid physical address from the MMU and all checks required for correctness are performed. That's why the load has to fault/assist.
The following quote from the paper discusses how to perform a RIDL attack in the same thread:
we perform the RIDL attack without SMT by writing values in our own thread and observing the values that we leak from the same thread. Figure3 shows that if we do not write the values (“no victim”), we leak only zeros, but with victim and attacker running in the same hardware thread (e.g., in a sandbox), we leak the secret value in almost all cases.
I think there are no privilege level changes in this experiment. The victim and the attacker run in the same OS thread on the same hardware thread. When returning from the victim to the attacker, there may still be some outstanding requests in the LFBs from (especially from stores). Note that in the RIDL paper, KPTI is enabled in all experiments (in contrast to the Fallout paper).
In addition to leaking data from LFBs, MLPDS shows that data can also be leaked from the load port buffers. These include the line-split buffers and the buffers used for loads larger than 8 bytes in size (which I think are needed when the size of the load uop is larger than the size of the load port, e.g., AVX 256b on SnB/IvB that occupy the port for 2 cycles).
The WB case (no flushing) from Figure 5 is also interesting. In this experiment, the victim thread writes 4 different values to 4 different cache lines instead of reading from the same cache line. The figure shows that, in the WB case, only the data written to the last cache line is leaked to the attacker. The explanation may depend on whether the cache lines are different in different iterations of the loop, which is unfortunately not clear in the paper. The paper says:
For WB without flushing, there is a signal only for the last cache line, which suggests that the CPU performs write combining in a single entry of the LFB before storing the data in the cache.
How can writes to different cache lines be combining in the same LFB before storing the data in the cache? That makes zero sense. An LFB can hold a single cache line and and a single physical address. It's just not possible to combine writes like that. What may be happening is that WB writes are being written in the LFBs allocated for their RFO requests. When the invalid physical address is transmitted to the LFBs for comparison, the data may always be provided from the LFB that was last allocated. This would explain why only the value written by the fourth store is leaked.
For information on MDS mitigations, see: What are the new MDS attacks, and how can they be mitigated?. My answer there only discusses mitigations based on the Intel microcode update (not the very interesting "software sequences").
The following figure shows the vulnerable structures that use data speculation.