llama.cpp

A C++ inference engine for running large language models on consumer hardware. The foundational library underlying most local AI tools: Ollama, Jan, GPT4All, and AnythingLLM all use llama.cpp under the hood.

GGUF Format

llama.cpp uses GGUF (GPT-Generated Unified Format) — a single-file format combining model weights and metadata. Benefits:

Fast loading and swapping between models
Enables quantization (storing weights at lower precision)
Portable: one file contains everything needed to run the model

Quantization

Models are typically released at 32-bit or 16-bit precision (high accuracy, large RAM footprint). llama.cpp supports quantizing models down to lower precision:

Precision	RAM Usage	Quality
16-bit	Full (baseline)	Full
4-bit	~25% of 16-bit	Near-comparable

The Q4_K_M naming convention: Q = quantized, 4 = 4-bit precision, K_M = the specific compression algorithm variant tuned for quality. Quantization saves ~75% on hardware requirements and improves inference throughput.

Platform Kernels

Optimized backends for nearly every platform:

Metal — Apple Silicon (Mac)
CUDA — NVIDIA GPU
ROCm — AMD GPU
Vulkan — cross-platform GPU
CPU — any hardware (slower but universal)

Usage Modes

CLI (terminal chat):

llama-cli --model model.gguf

Local OpenAI-compatible server:

llama-server --model model.gguf --port 8080

Exposes GET and POST endpoints on port 8080 — drop-in replacement for OpenAI API. Compatible with LangChain, LangGraph, and any tool that accepts an OpenAI-compatible endpoint.

Ecosystem

Supports multimodal models (image description, vision)
MCP integration for connecting external services
Powers Ollama, Jan, GPT4All, AnythingLLM
TurboQuant is a KV cache optimization that targets llama.cpp (in the process of merging upstream)

Open-Source Model Support

Supports any model released in GGUF format — DeepSeek, Llama, Qwen, and most other open-weight models available on Hugging Face.

AI For Dev

Explorer

llama-cpp

llama.cpp

GGUF Format

Quantization

Platform Kernels

Usage Modes

Ecosystem

Open-Source Model Support

See Also

Graph View

Table of Contents

Backlinks