LLM Inference Servers Compared - vLLM, SGLang, llama.cpp and Ollama
Choosing an inference server#
If you are setting up self-hosted inference for the first time, the breadth of options can feel like a distraction from the actual work. All four projects are genuinely good at what they are built for, and all four have OpenAI-compatible APIs, some even have native Anthropic support. The differences are about fit, not quality. Getting the fit wrong means either leaving performance on the table or running something far heavier than the workload warrants.
For the broader case for self-hosting vs. cloud APIs, see our guide to deploying LLMs on your own infrastructure.
Which engine fits your situation
vLLM
SGLang
llama.cpp
Ollama
vLLM#
vLLM was built from the ground up for high-throughput serving on NVIDIA GPUs, and it shows. The two mechanisms that define it are PagedAttention - a memory management technique that treats the GPU's KV cache like virtual memory, reducing fragmentation and enabling higher batch occupancy - and continuous batching, which lets the server slot new requests into active batches rather than waiting for an entire batch to complete.
The result is that vLLM handles bursty, high-concurrency workloads well. When you have many users hitting the same model at the same time, vLLM's scheduler makes better use of the available GPU memory than a naive queue would.
Where it shines:
- Production serving of large models on multi-GPU NVIDIA hardware
- High concurrency - tens to hundreds of simultaneous requests
- Multi-GPU tensor-parallel and pipeline-parallel setups
- Speculative decoding and prefix caching
What it costs you: vLLM is not a light-weight thing to run. It requires a CUDA-capable GPU with meaningful VRAM, has a non-trivial startup time as it compiles CUDA kernels on first run, and its configuration surface is broad. If you are on a CPU, Apple Silicon, or a small consumer GPU, vLLM is not the right starting point.
SGLang#
SGLang is the inference server that matters most if your workloads involve structured output, constrained decoding, or multi-step agentic pipelines. Its core innovation is RadixAttention, which uses a radix tree to aggressively reuse the KV cache across requests that share a common prefix - a system prompt that is identical for thousands of requests, for example, only needs to be computed once.
That prefix-reuse characteristic makes SGLang particularly strong for agentic and RAG workflows where every request starts with a large, stable context. It also has first-class support for vision-language models (VLMs) and structured output formats, including JSON schema enforcement and regular expression constraints.
Where it shines:
- Workloads with repeated, long system prompts (RAG, agents, tool-use pipelines)
- Structured output at scale
- VLM serving
- High throughput on GPU, competitive with vLLM on many benchmarks
What it costs you: Like vLLM, SGLang is a GPU-first server. It requires Python, CUDA, and a reasonably modern NVIDIA GPU. The project moves fast, which is a strength, though it means the documentation and configuration API change more frequently than the more mature alternatives.
llama.cpp#
llama.cpp is the project that made running large language models on commodity hardware realistic. Written in C and C++, with minimal dependencies, it runs on nearly anything: x86 CPUs, ARM CPUs, Apple Silicon via Metal, NVIDIA GPUs via CUDA, AMD GPUs via ROCm, Vulkan-capable hardware, and combinations of the above.
Its native format is GGUF, a single-file model format that bundles weights, metadata, and tokeniser into one portable file. Quantisation is first-class: you can run a 7B model in 4-bit quantisation on 6 GB of RAM, or a 70B model across CPU and GPU memory using partial offloading. The model library at Hugging Face is full of GGUF-format models ready to use.
llama.cpp ships a server mode (llama-server) with an OpenAI-compatible API, so it is not just a library for other tools to embed - it is a complete inference backend in its own right.
Where it shines:
- CPU inference with no GPU requirement
- Apple Silicon (M1/M2/M3/M4) via Metal acceleration
- Consumer GPUs where you need quantised models to fit in VRAM
- Edge and embedded deployments
- Low RAM budget environments
- Portability across operating systems
What it costs you: llama.cpp does not scale horizontally the way vLLM and SGLang do. It is a single-node server. Throughput under high concurrency is lower than the GPU-optimised servers. If you are serving production traffic with many concurrent users and have the GPU hardware to match, you will outgrow llama.cpp's server mode fairly quickly.
Ollama#
Ollama is llama.cpp with a model management layer, a clean CLI, and a local API server that starts automatically in the background. The value it adds is operational: you pull a model with one command, it is cached locally, and a process is already listening on port 11434 ready to serve it.
# Pull and run a model - that is the entire workflow
ollama pull gemma4:4b
ollama run gemma4:4b The API follows the OpenAI /v1/chat/completions and /v1/models conventions, which means anything that speaks to OpenAI will speak to Ollama with a one-line endpoint change. Ollama runs on macOS, Linux, and Windows, and on Apple Silicon it uses Metal for GPU acceleration automatically.
Where it shines:
- Local development and prototyping
- Getting started with self-hosted inference in under 5 minutes
- macOS and Windows workstations where you want a managed background process
- Teams where developers need a standard local setup without ops knowledge
- Running models on Apple Silicon
What it costs you: Ollama's simplicity is also its ceiling. The concurrency model, configuration surface, and performance tuning options are narrower than vLLM or SGLang. It is not the right tool when you need to serve production traffic to many concurrent users or need fine-grained control over batching and memory.
Also in the field#
The four above cover the most common deployments, but the ecosystem is wider.
HuggingFace TGI (Text Generation Inference) was an influential early production server. As of March 2026, the repository is archived and TGI is in maintenance mode, with new development paused. Existing deployments continue to work, but it is no longer the recommended starting point for new projects.
TensorRT-LLM (NVIDIA) delivers the highest throughput numbers on NVIDIA hardware by compiling models directly to TensorRT engine format. The performance ceiling is real, but so is the operational complexity: it requires NVIDIA hardware at every stage, model-specific build steps, and a deeper understanding of the NVIDIA stack than most teams want to carry early on.
LMDeploy (from the Shanghai AI Lab team) is a strong choice for production serving on NVIDIA hardware, particularly for quantised models and long-context workloads. Its TurboMind engine has good benchmark results and it is worth evaluating if you need an alternative to vLLM.
MLX / mlx-lm is Apple's framework for fast inference on Apple Silicon. mlx-lm provides text generation and fine-tuning directly on M-series hardware, and is the recommended path for native Apple Silicon serving outside of Ollama and llama.cpp.
oMLX is a macOS-native inference server built on top of mlx-lm (not to be confused with Apple's MLX framework itself). Its standout feature is paged SSD KV caching: cache blocks are persisted to disk in safetensors format with a two-tier RAM/SSD architecture, so previously cached prefixes survive context shifts and server restarts without recomputation. That cuts time-to-first-token on long contexts from 30-90 seconds down to under 5 seconds, which matters most for coding agents that repeatedly invalidate the KV cache. It exposes both OpenAI-compatible (/v1/chat/completions) and Anthropic-compatible (/v1/messages) endpoints, supports continuous batching, multi-model serving (LLM, VLM, embedding, reranker), and ships as a signed, notarised macOS menu bar app. Apache 2.0, Apple Silicon only, requires macOS 15+. GitHub: jundot/omlx.
vLLM on Apple Silicon (sometimes called vLLM-MLX in community discussions) is vLLM's Apple Silicon support path - useful if you want a vLLM-compatible API surface on Mac hardware consistent with your GPU servers.
Lemonade / LemonadeSDK is a community-built local LLM server, sponsored and optimised by AMD, with particular attention to Ryzen AI and Radeon hardware. It exposes OpenAI, Anthropic, and Ollama-compatible APIs.
Docker Model Runner is worth a slightly fuller note because it fits a different mental model than the others. Rather than running a separate inference service, it brings model serving into the standard Docker workflow - you pull and run models as OCI artifacts with docker model run, and Docker exposes an OpenAI-compatible endpoint automatically. For teams already standardised on Docker, this removes the "extra service to operate" friction entirely and makes LLM serving feel like any other containerised workload.
LiteLLM appears on lists like this often, but it is more of a unified API proxy and gateway than an inference engine - it does not run models itself, it routes requests to backends that do. If you are evaluating it as an alternative to a proxy like Olla, see our Olla vs LiteLLM comparison.
Our tools work with all of these
Whichever backend you land on, you do not have to bet on just one. Olla supports all of the engines covered here - Ollama, vLLM, SGLang, llama.cpp, TensorRT-LLM, LMDeploy, MLX/mlx-lm, Lemonade, Docker Model Runner, LiteLLM, and more - as do FoundryOS and Alloy. We continue to add native support for new inference platforms as they mature, so your application layer stays stable while the backend landscape evolves.
Comparison table#
| Engine | Primary use | Hardware target | Concurrency strength | Ease of operation | Quantisation | API compatibility | Best for |
|---|---|---|---|---|---|---|---|
| vLLM | Production serving | NVIDIA GPU (multi-GPU capable) | High - PagedAttention + continuous batching | Moderate - GPU setup, CUDA deps | GPTQ, AWQ, FP8 | OpenAI /v1 | High-concurrency production on NVIDIA |
| SGLang | Agentic / structured output | NVIDIA GPU | High - RadixAttention prefix reuse | Moderate - fast-moving project | GPTQ, AWQ, FP8 | OpenAI /v1 | Agentic pipelines, repeated prompts, VLMs |
| llama.cpp | Portability and low-resource | CPU, Apple Silicon, consumer GPU, edge | Low-moderate - single node | Low - minimal deps, static binary | GGUF (2-bit to 8-bit) | OpenAI /v1 (via llama-server) | CPU/edge inference, quantised models |
| Ollama | Local dev UX | macOS, Linux, Windows, Apple Silicon | Low-moderate - wraps llama.cpp | Very low - one-command setup | GGUF (via llama.cpp) | OpenAI /v1 + Ollama-native | Developer workstations, getting started |
| oMLX | Apple Silicon local inference with persistent KV cache | Apple Silicon only (macOS 15+) | Moderate - continuous batching via mlx-lm | Low - native macOS app, one-click setup | MLX-format models (4-bit, 8-bit; not GGUF) | OpenAI /v1 + Anthropic /v1/messages | Mac-native serving for coding agents and long-context workloads |
You do not have to pick just one#
Most teams end up running more than one backend
In practice, most teams end up with more than one inference backend running at the same time. The GPU box in the office runs vLLM and handles the heavier models. Developer workstations run Ollama. A Mac mini running llama.cpp serves as a fallback when the GPU box is at capacity. That is not a mess - it is a reasonable use of the hardware you have.
The operational problem that emerges is that every application now has to know about all of these backends, their different addresses, and their different model naming conventions. When one goes down, the application breaks. When you add a new backend, you update code.
This is exactly the problem an LLM proxy solves. A proxy sits in front of all your backends and gives applications a single endpoint. It handles health checking, failover, and model-aware routing, so your code never needs to care which backend is currently serving a request. For a full explanation of how this works, see What is an LLM proxy?.
Olla is our open-source proxy (Apache 2.0, written in Go) that supports all four engines covered here - Ollama, vLLM, SGLang, and llama.cpp - plus LM Studio, LiteLLM, LMDeploy, vLLM-MLX, Docker Model Runner, and others. It uses roughly 50 MB of RAM at rest. You point it at your backends, and your applications talk to one address regardless of what is running behind it.
If your hardware budget is the deciding factor across all of this, the self-hosted vs. cloud API cost comparison gives a practical breakdown of where self-hosting pays off.
Key takeaways
- You have NVIDIA GPU hardware and expect concurrent users: start with vLLM. It is the most mature production server for this case and its PagedAttention makes good use of GPU memory under load.
- Your workloads are agentic or involve long repeated system prompts: SGLang's prefix caching pays off here. Also worth evaluating for VLM serving.
- You are on CPU, Apple Silicon, or a small consumer GPU: llama.cpp is the right engine. GGUF quantisation gives you models that actually fit in your available memory.
- You need something running on a developer workstation in under 10 minutes: Ollama. It handles the model management and process lifecycle so you do not have to.
- You are running more than one backend: put a proxy in front of them. One endpoint, health checking, failover, and model-aware routing - without changing your application code.
Further reading#
- vLLM documentation - production GPU inference server
- SGLang on GitHub - fast LLM and VLM serving with RadixAttention
- llama.cpp on GitHub - portable C/C++ inference with GGUF
- Ollama - local model management and inference
- Olla on GitHub - open-source LLM proxy for all of the above
- What is an LLM proxy? - how a proxy sits in front of multiple backends
- Deploying LLMs on your own infrastructure - the full self-hosting guide
- Self-hosted LLM vs cloud API cost - when self-hosting makes financial sense