caching x86 cpu cpu-architecture cpu-cache

If cache invalidation happens every time memory mappings change, why not opt for VIVT?

As far as I know, L1 is VIPT for at least Intel chips. VIVT caches don't depend on address translation, so they can fully operate in parallel with TLB lookup. VIPT can also achieve some parallelism by not involving physical address bits for set indexes, but once the set is known, way lookup must wait for TLB results (unless there's a way predictor). So there is at least some level of dependence to the TLB, whereas VIVT operations are fully independent.

The main issue with VIVT is that memory mappings change over time, and the cache (or part of it) must be invalidated every time that happens. But that's also true for VIPT? As per this article:

In general, when Linux is changing an existing virtual–>physical mapping to a new value, the sequence will be in one of the following forms:

1) flush_cache_mm(mm);
   change_all_page_tables_of(mm);
   flush_tlb_mm(mm);

2) flush_cache_range(vma, start, end);
   change_range_of_page_tables(mm, start, end);
   flush_tlb_range(vma, start, end);

3) flush_cache_page(vma, addr, pfn);
   set_pte(pte_pointer, new_pte_val);
   flush_tlb_page(vma, addr);

Partial or full cache invalidation is step 1 in all three cases. Switching to a different process definitely changes virtual -> physical mappings, so I'd assume the flush will be performed by the kernel. If I understand correctly, even when mappings change within the same process, Linux will invalidate (part of) L1. With the invalidation in the middle, I think the issue with VIVT is resolved, and since invalidation is performed anyway, the benefits of VIPT are nulled?

Solution

Intel (and other x86 vendors) don't need to invalidate L1 caches when changing page tables; they work equivalently to PIPT but faster.

Intel has always built their L1 VIPT caches with the index bits coming purely from the offset-within-page part of the address, which are the same for the virtual and physical address. So you get the speed advantage of VIPT, but in all other ways behaves as PIPT, with no aliasing.
This property is achieved by making the cache small enough and associative enough that the size of a single way is 4K or less. For example, 16K 4-way in Pentium III, 32K 8-way for many years, 48K 12-way in Ice Lake. 16K/4 = 4K. 48K/12 = 4K.
Except in their Silvermont cores, e.g. Gracemont E-cores have 64K 8-way L1i, but still 32K 8-way L1d.

AMD has sometimes used larger less-associative caches, especially for instruction caches. They use way-prediction and micro-tags; see How is AMD's micro-tagged L1 data cache accessed?.
Zen 1 used a 64K 4-way L1i, so that's 4x larger than possible without using index bits from the page-number part of the address. (Anandtech's Zen article has a table of cache size/associativity for L1i/d for various uarches up to Zen 1.)
Zen 2 changed to 32K each L1i/d.

With the tag overlapping the index (e.g. by those 2 bits for Zen 1's L1i), checking for hits can avoid false positives after a remap. For a read-only cache like Zen 1's L1i, an extra miss just costs performance, no correctness problem since there isn't dirty data you need to notice. See https://www.phoronix.com/review/amd_bulldozer_aliasing re: such performance issues. (And my answer on Performance implications of aliasing in VIPT cache)

The kernel certainly doesn't have to manually do anything beyond changing the page table or invalidating a single page; that would mean kernels written for 386 (which didn't have onboard cache or instructions to manage it) wouldn't have worked on later CPUs. But backwards compatibility (of new hardware with old software) has been x86's defining feature and a key part of its commercial success in early years.

So for example the uop cache in Sandybridge-family must invalidate itself when the mapping for a page changes, because it is virtually addressed.