vLLM

Open-source LLM inference engine optimized for high-throughput parallel serving on Nvidia GPUs. Where Ollama and LM Studio are designed around single-request local workflows, vLLM is built to saturate a GPU with hundreds of concurrent requests — making it the practical choice for code completion, multi-user serving, and any workload that benefits from batching.

  • GitHub: vllm-project/vllm
  • License: Open source (Apache 2.0)
  • Best fit: Nvidia GPUs (RTX 40-series, 50-series, Blackwell, Pro 6000); also AMD Instinct cards. Consumer AMD cards work with caveats.

Why It Matters

Alex Ziskind benchmarked the same Quen 3 Coder 30B model across runtimes on the same hardware (RTX PRO 6000). Throughput numbers from his demo:

RuntimeConcurrent usersTokens/sec
LM Studio1~80
LM Studio4~80 (queues, doesn’t parallelize)
llama.cpp Bench1~78
Docker Model Runner4~88 (some parallelism)
vLLM (Docker)4298
vLLM (Docker)2565,800–6,000

The pattern: LM Studio cannot scale beyond single request even though it wraps llama.cpp. Docker Model Runner has limited parallelism. vLLM is the only option that fully saturates the GPU.

Why Code Completion Specifically Needs This

Chat is one request at a time — single-thread throughput is fine. Code completion is different: every keystroke can fire a new completion request, and a single developer can have multiple completion requests in flight simultaneously. Without parallelism, requests queue up and latency explodes.

“When you’re doing code completion, it’s sending tons of data to your provider… GPU stays saturated, queuing drops down, and latency drops down as well.” — Alex Ziskind

This is the practical reason vLLM matters even for single-developer use cases — you’re not the only thing your editor is asking the model.

Setup via Docker

vLLM is harder to configure raw than Ollama, but the Docker image makes it portable across Nvidia GPUs:

# SSH into the GPU host
docker run --gpus all \
  vllm/vllm-openai:latest \
  --model <model-card-or-path> \
  --quantization fp8 \
  --max-num-seqs 256

The same image works on RTX 40, 50, Pro 6000, Blackwell — anywhere with a CUDA-capable GPU. Pair with FP8 on Blackwell tensor cores for the maximum-throughput configuration.

Quantization Support

vLLM supports the full quantization stack relevant to Nvidia hardware:

  • BF16 — baseline, requires datacenter GPU
  • Q8 / Int8 — common, broad GPU support
  • FP8 — floating-point 8-bit, native to Blackwell tensor cores, the format Alex demos
  • FP4 — even faster, Alex teases a separate video

Of these, FP8 is the sweet spot on Nvidia Blackwell — significantly faster than Int8 with better precision retention. The Quen 3 Coder 30B FP8 build is what produces the 5,800+ tok/s figure.

How It Compares

vLLMOllamaLM StudioDocker Model Runnerllama.cpp
ParallelismYes (full GPU saturation)LimitedNoModestTooling-dependent
OpenAI-compatible APIYesYesYesYesYes (server mode)
Setup complexityMedium-high (Docker fixes most)LowLowestLowMedium
Mac supportNo (Nvidia-focused)YesYesYes (limited GPU)Yes
Best forProduction / multi-user / code completionStandalone local chatStandalone local chatContainer-native devCustom inference

The decision tree:

  • Local chat, one user, Mac? → Ollama or LM Studio
  • Container-native dev, modest concurrency? → Docker Model Runner
  • Code completion, serious throughput, Nvidia GPU?vLLM
  • Multi-user serving in production?vLLM
  • Tinkering with custom inference primitives? → llama.cpp directly

Relationship to the Benchmarks Section

vLLM is what unlocks the top-tier scores for the Lenovo P8 Threadripper Proxmox host with RTX PRO 6000 Blackwell Max-Q in our hardware benchmarks. The Geekbench AI numbers reflect single-request throughput; vLLM with FP8 unlocks an additional ~70x improvement on concurrent workloads. This is the practical answer to “why pay $9,145 for the PRO 6000” — it’s not the single-request speed, it’s the parallelism + FP8 combination.

See Also