AS
AgSkills.dev
MARKETPLACE

gguf-quantization

GGUF format and llama.cpp quantization for efficient CPU/GPU inference. Use when deploying models on consumer hardware, Apple Silicon, or when needing flexible quantization from 2-8 bit without GPU requirements.

21.2k
1.9k

Preview

SKILL.md
name
gguf-quantization
description
GGUF format and llama.cpp quantization for efficient CPU/GPU inference. Use when deploying models on consumer hardware, Apple Silicon, or when needing flexible quantization from 2-8 bit without GPU requirements.
version
1.0.0
author
Orchestra Research
license
MIT
tags
[GGUF, Quantization, llama.cpp, CPU Inference, Apple Silicon, Model Compression, Optimization]
dependencies
[llama-cpp-python>=0.2.0]

GGUF - Quantization Format for llama.cpp

The GGUF (GPT-Generated Unified Format) is the standard file format for llama.cpp, enabling efficient inference on CPUs, Apple Silicon, and GPUs with flexible quantization options.

When to use GGUF

Use GGUF when:

  • Deploying on consumer hardware (laptops, desktops)
  • Running on Apple Silicon (M1/M2/M3) with Metal acceleration
  • Need CPU inference without GPU requirements
  • Want flexible quantization (Q2_K to Q8_0)
  • Using local AI tools (LM Studio, Ollama, text-generation-webui)

Key advantages:

  • Universal hardware: CPU, Apple Silicon, NVIDIA, AMD support
  • No Python runtime: Pure C/C++ inference
  • Flexible quantization: 2-8 bit with various methods (K-quants)
  • Ecosystem support: LM Studio, Ollama, koboldcpp, and more
  • imatrix: Importance matrix for better low-bit quality

Use alternatives instead:

  • AWQ/GPTQ: Maximum accuracy with calibration on NVIDIA GPUs
  • HQQ: Fast calibration-free quantization for HuggingFace
  • bitsandbytes: Simple integration with transformers library
  • TensorRT-LLM: Production NVIDIA deployment with maximum speed

Quick start

Installation

# Clone llama.cpp git clone https://github.com/ggml-org/llama.cpp cd llama.cpp # Build (CPU) make # Build with CUDA (NVIDIA) make GGML_CUDA=1 # Build with Metal (Apple Silicon) make GGML_METAL=1 # Install Python bindings (optional) pip install llama-cpp-python

Convert model to GGUF

# Install requirements pip install -r requirements.txt # Convert HuggingFace model to GGUF (FP16) python convert_hf_to_gguf.py ./path/to/model --outfile model-f16.gguf # Or specify output type python convert_hf_to_gguf.py ./path/to/model \ --outfile model-f16.gguf \ --outtype f16

Quantize model

# Basic quantization to Q4_K_M ./llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M # Quantize with importance matrix (better quality) ./llama-imatrix -m model-f16.gguf -f calibration.txt -o model.imatrix ./llama-quantize --imatrix model.imatrix model-f16.gguf model-q4_k_m.gguf Q4_K_M

Run inference

# CLI inference ./llama-cli -m model-q4_k_m.gguf -p "Hello, how are you?" # Interactive mode ./llama-cli -m model-q4_k_m.gguf --interactive # With GPU offload ./llama-cli -m model-q4_k_m.gguf -ngl 35 -p "Hello!"

Quantization types

K-quant methods (recommended)

TypeBitsSize (7B)QualityUse Case
Q2_K2.5~2.8 GBLowExtreme compression
Q3_K_S3.0~3.0 GBLow-MedMemory constrained
Q3_K_M3.3~3.3 GBMediumBalance
Q4_K_S4.0~3.8 GBMed-HighGood balance
Q4_K_M4.5~4.1 GBHighRecommended default
Q5_K_S5.0~4.6 GBHighQuality focused
Q5_K_M5.5~4.8 GBVery HighHigh quality
Q6_K6.0~5.5 GBExcellentNear-original
Q8_08.0~7.2 GBBestMaximum quality

Legacy methods

TypeDescription
Q4_04-bit, basic
Q4_14-bit with delta
Q5_05-bit, basic
Q5_15-bit with delta

Recommendation: Use K-quant methods (Q4_K_M, Q5_K_M) for best quality/size ratio.

Conversion workflows

Workflow 1: HuggingFace to GGUF

# 1. Download model huggingface-cli download meta-llama/Llama-3.1-8B --local-dir ./llama-3.1-8b # 2. Convert to GGUF (FP16) python convert_hf_to_gguf.py ./llama-3.1-8b \ --outfile llama-3.1-8b-f16.gguf \ --outtype f16 # 3. Quantize ./llama-quantize llama-3.1-8b-f16.gguf llama-3.1-8b-q4_k_m.gguf Q4_K_M # 4. Test ./llama-cli -m llama-3.1-8b-q4_k_m.gguf -p "Hello!" -n 50

Workflow 2: With importance matrix (better quality)

# 1. Convert to GGUF python convert_hf_to_gguf.py ./model --outfile model-f16.gguf # 2. Create calibration text (diverse samples) cat > calibration.txt << 'EOF' The quick brown fox jumps over the lazy dog. Machine learning is a subset of artificial intelligence. Python is a popular programming language. # Add more diverse text samples... EOF # 3. Generate importance matrix ./llama-imatrix -m model-f16.gguf \ -f calibration.txt \ --chunk 512 \ -o model.imatrix \ -ngl 35 # GPU layers if available # 4. Quantize with imatrix ./llama-quantize --imatrix model.imatrix \ model-f16.gguf \ model-q4_k_m.gguf \ Q4_K_M

Workflow 3: Multiple quantizations

#!/bin/bash MODEL="llama-3.1-8b-f16.gguf" IMATRIX="llama-3.1-8b.imatrix" # Generate imatrix once ./llama-imatrix -m $MODEL -f wiki.txt -o $IMATRIX -ngl 35 # Create multiple quantizations for QUANT in Q4_K_M Q5_K_M Q6_K Q8_0; do OUTPUT="llama-3.1-8b-${QUANT,,}.gguf" ./llama-quantize --imatrix $IMATRIX $MODEL $OUTPUT $QUANT echo "Created: $OUTPUT ($(du -h $OUTPUT | cut -f1))" done

Python usage

llama-cpp-python

from llama_cpp import Llama # Load model llm = Llama( model_path="./model-q4_k_m.gguf", n_ctx=4096, # Context window n_gpu_layers=35, # GPU offload (0 for CPU only) n_threads=8 # CPU threads ) # Generate output = llm( "What is machine learning?", max_tokens=256, temperature=0.7, stop=["</s>", "\n\n"] ) print(output["choices"][0]["text"])

Chat completion

from llama_cpp import Llama llm = Llama( model_path="./model-q4_k_m.gguf", n_ctx=4096, n_gpu_layers=35, chat_format="llama-3" # Or "chatml", "mistral", etc. ) messages = [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is Python?"} ] response = llm.create_chat_completion( messages=messages, max_tokens=256, temperature=0.7 ) print(response["choices"][0]["message"]["content"])

Streaming

from llama_cpp import Llama llm = Llama(model_path="./model-q4_k_m.gguf", n_gpu_layers=35) # Stream tokens for chunk in llm( "Explain quantum computing:", max_tokens=256, stream=True ): print(chunk["choices"][0]["text"], end="", flush=True)

Server mode

Start OpenAI-compatible server

# Start server ./llama-server -m model-q4_k_m.gguf \ --host 0.0.0.0 \ --port 8080 \ -ngl 35 \ -c 4096 # Or with Python bindings python -m llama_cpp.server \ --model model-q4_k_m.gguf \ --n_gpu_layers 35 \ --host 0.0.0.0 \ --port 8080

Use with OpenAI client

from openai import OpenAI client = OpenAI( base_url="http://localhost:8080/v1", api_key="not-needed" ) response = client.chat.completions.create( model="local-model", messages=[{"role": "user", "content": "Hello!"}], max_tokens=256 ) print(response.choices[0].message.content)

Hardware optimization

Apple Silicon (Metal)

# Build with Metal make clean && make GGML_METAL=1 # Run with Metal acceleration ./llama-cli -m model.gguf -ngl 99 -p "Hello" # Python with Metal llm = Llama( model_path="model.gguf", n_gpu_layers=99, # Offload all layers n_threads=1 # Metal handles parallelism )

NVIDIA CUDA

# Build with CUDA make clean && make GGML_CUDA=1 # Run with CUDA ./llama-cli -m model.gguf -ngl 35 -p "Hello" # Specify GPU CUDA_VISIBLE_DEVICES=0 ./llama-cli -m model.gguf -ngl 35

CPU optimization

# Build with AVX2/AVX512 make clean && make # Run with optimal threads ./llama-cli -m model.gguf -t 8 -p "Hello" # Python CPU config llm = Llama( model_path="model.gguf", n_gpu_layers=0, # CPU only n_threads=8, # Match physical cores n_batch=512 # Batch size for prompt processing )

Integration with tools

Ollama

# Create Modelfile cat > Modelfile << 'EOF' FROM ./model-q4_k_m.gguf TEMPLATE """{{ .System }} {{ .Prompt }}""" PARAMETER temperature 0.7 PARAMETER num_ctx 4096 EOF # Create Ollama model ollama create mymodel -f Modelfile # Run ollama run mymodel "Hello!"

LM Studio

  1. Place GGUF file in ~/.cache/lm-studio/models/
  2. Open LM Studio and select the model
  3. Configure context length and GPU offload
  4. Start inference

text-generation-webui

# Place in models folder cp model-q4_k_m.gguf text-generation-webui/models/ # Start with llama.cpp loader python server.py --model model-q4_k_m.gguf --loader llama.cpp --n-gpu-layers 35

Best practices

  1. Use K-quants: Q4_K_M offers best quality/size balance
  2. Use imatrix: Always use importance matrix for Q4 and below
  3. GPU offload: Offload as many layers as VRAM allows
  4. Context length: Start with 4096, increase if needed
  5. Thread count: Match physical CPU cores, not logical
  6. Batch size: Increase n_batch for faster prompt processing

Common issues

Model loads slowly:

# Use mmap for faster loading ./llama-cli -m model.gguf --mmap

Out of memory:

# Reduce GPU layers ./llama-cli -m model.gguf -ngl 20 # Reduce from 35 # Or use smaller quantization ./llama-quantize model-f16.gguf model-q3_k_m.gguf Q3_K_M

Poor quality at low bits:

# Always use imatrix for Q4 and below ./llama-imatrix -m model-f16.gguf -f calibration.txt -o model.imatrix ./llama-quantize --imatrix model.imatrix model-f16.gguf model-q4_k_m.gguf Q4_K_M

References

Resources

GitHub Repository
davila7/claude-code-templates
Stars
21,279
Forks
1,987
Open Repository
Install Skill
Download ZIP3 files