My understanding of PAE and paging is following and may be incomplete, so I may be wrong. In AMD64 system programming manual there's such a picture:
that says it is possible to address 40 bits physical address without PAE enabled, but with PSE enabled, and to address 52 bits physical address using PAE (and optional PSE), that's OK.
As I can see later for PTE with PAE enabled
Physical page base address field was extended from 20 bits to 40 bits, so it is possible to use 52 physical addresses (using 40 bits page frame numbers).
Then I took a look at CR3 register with and without PAE enabled:
Without PAE it has 20 bits for page directory table address, which means that we may specify any page within 4G address space:
// max page frame number
2^20 - 1 = 1048575
// PAGE_SIZE = 4096
1048576 * 4096 = 4 GiB
So, it is possible to store PDT base address within 4G, looks like it's fairly enough. Even if we're going to address 40 bits physical address space we can, theoretically, hold all page tables in memory, since with PSE enabled page table size is 4 KiB
:
// max address space
2^40 = 1 TiB
// how many page tables we need
// to address ALL 2^40 physical address space
1 TiB / 4 GiB = 256
// we need 1 MiB of memory
// to be able to store ALL page tables
// with PSE enabled to address 2^40
// physical address space.
//
// Since PDE is 4 KiB aligned memory space
// the page table size will be exactly 4 KiB
256 * 4 KiB = 1 MiB
Once, PAE enabled it doesn't matter whether PSE is used or not, we can address all 52 bits physical addresses. The only moment where I'm stuck is that PDT base address in CR3 register was extended from 20 to 27 bits:
// max physical address space
2^52 = 4 PiB
// max number of page tables
4 PiB / 4 GiB = 1048576
// without PSE page table size is about 8 KiB
1048576 * 8 KiB = 8 GiB
// with PSE enabled page table size is about 4 KiB
1048576 * 4 KiB = 4 GiB
So, theoretically, we need about 8 GiB memory to store all page tables if PAE enabled without PSE and about 4 GiB with PSE enabled. Since, this is only theoretical things and nobody in those years could use 4 PiB physical address space, so looks like extending PDT base address to 27 bits in CR3 is a waste and has no real usage, since nobody stores ALL page tables 1:1 in production based systems (maybe I'm wrong), as for me this looks weird.
So, after all, my question is the same as in the title -- "what is the reason to extend page-directory-pointer-table base address in cr3 from 20 to 27 bits in AMD64 legacy mode (PAE)?"
In legacy mode with PAE, translating 32-bit virtual addresses takes 3 levels instead of the traditional 2, since PAE translates only 9 bits per level with 512x 64-bit PTEs, instead of 10 with 1024x 32-bit PTEs filling a 4K page.
Two levels of PAE translates 12 + 9 + 9 = 30 bits, meaning only 2 more virtual-address bits need to be translated by the top level, the table of PDPEs that CR3 points to. This top-level table only needs 4 entries, thus 32 bytes, not a full 4096 B page.
So in legacy-mode PAE, CR3 points to a 32-byte (naturally-aligned) mini table of PDPEs. These don't have to start at the beginning of a 4K page, so they need more address bits at the low end for finer granularity.
log2(32) = 5, and 32-5 = 27 bits to address any 32-byte chunk of phys mem in the low 32.
(The PDPEs need to be the low 4GiB of physical address space so CR3 can point to them. PDEs and PTEs can be anywhere because they're only pointed-to by 8-byte PDPEs and PDEs respectively).