I'm developing LLM agents using llama.cpp as inference engine. Sometimes I want to use models in safetensors format and there is a python script (https://github.com/ggerganov/llama.cpp/blob/master/convert_hf_to_gguf.py) to convert.
Script is awesome, but minimum number size is 8 bit (q8_0). Is there any other script or repo with other quantization formats?
First step is convert huggingface model to gguf (16b float or 32b float is recommended) using convert_hf_to_gguf.py
from llama.cpp repository.
Second step is use compiled c++ code from /examples/quantize/
subdirectory of llama.cpp (https://github.com/ggerganov/llama.cpp/tree/master/examples/quantize)
Process is pretty straightforward and well-documented.