For example, according to https://cocktailpeanut.github.io/dalai/#/ the relevant figures for LLaMA-65B are:
The full model won't fit in memory on even a high-end desktop computer.
The quantized one would. (But would not fit in video memory on even a $2000 Nvidia graphics card.)
However, CPUs don't generally support anything less than fp32. And when I've tried running Bloom 3B and 7B on a machine without a GPU, sure enough, the memory consumption has seemed to be 12 and 28GB respectively.
Is there a way to gain the memory savings of quantization, when running the model on a CPU?
Okay, finally got LLaMA-7B running on CPU and measured: fp16 version takes 14GB, fp32 version takes 28GB. This is on an old CPU that does not have AVX-512, so presumably it's expanding the format on reading into either cache or registers, but either way yes, it is gaining the memory saving.