I imagine CPUs have to have features that allow it to communicate and work with the GPU, and I can imagine this exists today, but in the early days of GPUs, how did companies get support from large CPU companies to have their devices be supported, and what features did CPU companies add to enable this?
You mean special support beyond just being devices on a bus like PCI? (Or even older, ISA or VLB.)
TL:DR: All the special features CPUs have which are useful for improved bandwidth to write (and sometimes read) video memory came after 3D graphics cards were commercially successful. They weren't necessary, just a performance boost.
Once GPUs were commercially successful and popular, and a necessary part of a gaming PC, it made obvious sense for CPU vendors to add features to make things better.
The same IO busses that let you plug in a sound card or network card already have the capabilities to access device memory and MMIO, and device IO ports, which is all that's necessary for video drivers to make a graphics card do things.
Modern GPUs are often the highest-bandwidth devices in a system (especially non-servers), so they benefit from fast buses, hence AGP for a while, until PCI Express (PCIe) unified everything again.
Anyway, graphics cards could work on standard busses; it was only once 3D graphics became popular and commercially important (and fast enough for the PCI bus to be a bottleneck), that things needed to change. At that point, CPU / motherboard companies were fully aware that consumers cared about 3D games, and thus it would make sense to develop a new bus specifically for graphics cards.
(Along with a GART, graphics address/aperture remapping table, an IOMMU that made it much easier / safer for drivers to let an AGP or PCIe video card read directly from system memory. Including I think with addresses under control of user-space, without letting user-space read arbitrary system memory, thanks to it being an IOMMU that only allows a certain address range.)
Before the GART was a thing, I assume drivers for PCI GPUs needed to have the host CPU initiate DMA to the device. Or if bus-master DMA by the GPU did happen, it could read any byte of physical memory in the system if it wanted, so drivers would have to be careful not to let programs pass arbitrary pointers.
Anyway, having a GART was new with AGP, which post-dates early 3D graphics cards like 3dfx's Voodoo and ATI 3D Rage. I don't know enough details to be sure I'm accurately describing the functionality a GART enables.
So most of the support for GPUs was in terms of busses, and thus a chipset thing, not CPUs proper. (Back then, CPUs didn't have integrated memory controllers, instead just talking to the chipset northbridge over a frontside bus.)
Relevant CPU instructions included Intel's SSE and SSE2 instruction sets, which had streaming (NT = non-temporal) stores which are good for storing large amounts of data that won't be re-read by the CPU any time soon, if at all.
SSE4.1 in 2nd-gen Core2 (2008 ish) added a streaming load instruction (movntdqa
) which (still) only does anything special if used on memory regions marked in the CPU's page tables or MTRR as WC (aka USWC: uncacheable, write-combining). Copying back from GPU memory to the host was the intended use-case. (Non-temporal loads and the hardware prefetcher, do they work together?)
x86 CPUs introducing the MTRR (Memory Type Range Register) is another feature that improved CPU -> GPU write bandwidth. Again, this came after 3D graphics were commercially successful for gaming.