pythonmachine-learningneural-networkcpuhalf-precision-float

Can language model inference on a CPU, save memory by quantizing?


For example, according to https://cocktailpeanut.github.io/dalai/#/ the relevant figures for LLaMA-65B are:

The full model won't fit in memory on even a high-end desktop computer.

The quantized one would. (But would not fit in video memory on even a $2000 Nvidia graphics card.)

However, CPUs don't generally support anything less than fp32. And when I've tried running Bloom 3B and 7B on a machine without a GPU, sure enough, the memory consumption has seemed to be 12 and 28GB respectively.

Is there a way to gain the memory savings of quantization, when running the model on a CPU?


Solution

  • Okay, finally got LLaMA-7B running on CPU and measured: fp16 version takes 14GB, fp32 version takes 28GB. This is on an old CPU that does not have AVX-512, so presumably it's expanding the format on reading into either cache or registers, but either way yes, it is gaining the memory saving.