large-language-modelhuggingfacequantizationllamacpp

How to quantize a HF safetensors model and save it to llama.cpp GGUF format with less than q8_0 quantization?


I'm developing LLM agents using llama.cpp as inference engine. Sometimes I want to use models in safetensors format and there is a python script (https://github.com/ggerganov/llama.cpp/blob/master/convert_hf_to_gguf.py) to convert.

Script is awesome, but minimum number size is 8 bit (q8_0). Is there any other script or repo with other quantization formats?


Solution

  • First step is convert huggingface model to gguf (16b float or 32b float is recommended) using convert_hf_to_gguf.py from llama.cpp repository.

    Second step is use compiled c++ code from /examples/quantize/ subdirectory of llama.cpp (https://github.com/ggerganov/llama.cpp/tree/master/examples/quantize)

    Process is pretty straightforward and well-documented.