Which LLM inference server should I use?

There is no single best server; it depends on your hardware and throughput. Use vLLM for high-concurrency production on NVIDIA GPUs, SGLang for agentic pipelines and repeated prompts, llama.cpp for CPU, edge and quantised models, and Ollama for developer workstations and getting started.

What is the difference between vLLM and Ollama?

vLLM is built for high-throughput serving on NVIDIA GPUs using PagedAttention and continuous batching, suited to many concurrent users. Ollama is built for developer workstations and getting started, bundling model management and a background server in one command with no ops knowledge required.

Do I have to pick just one inference server?

No. Different workloads suit different engines, and a proxy such as Olla can route across several backends at once, so you can run vLLM for production GPU serving while keeping llama.cpp or Ollama elsewhere, all behind a single endpoint.

LLM Inference Servers Compared - vLLM, SGLang, llama.cpp and Ollama

Choosing an inference server#

There is no single best LLM inference server. The right choice depends on the hardware you are running on, the throughput you need, and how much operational complexity you are willing to carry. This guide covers the four most widely used options - vLLM, SGLang, llama.cpp, and Ollama - so you can make that call with a clear picture of each.

If you are setting up self-hosted inference for the first time, the breadth of options can feel like a distraction from the actual work. All four projects are genuinely good at what they are built for, and all four have OpenAI-compatible APIs, some even have native Anthropic support. The differences are about fit, not quality. Getting the fit wrong means either leaving performance on the table or running something far heavier than the workload warrants.

For the broader case for self-hosting vs. cloud APIs, see our guide to deploying LLMs on your own infrastructure.

Which engine fits your situation

vLLM

High-concurrency production on NVIDIA GPU hardware - PagedAttention and continuous batching make the most of available VRAM under load.

SGLang

Agentic pipelines, repeated prompts, and VLMs - RadixAttention reuses shared prefixes so you avoid recomputing the same context on every request.

llama.cpp

CPU and edge inference, quantised models - runs on nearly any hardware including Apple Silicon and consumer GPUs with minimal dependencies.

Ollama

Developer workstations and getting started - model management and a background server in one command, with no ops knowledge required.

vLLM#

vLLM was built from the ground up for high-throughput serving on NVIDIA GPUs, and it shows. The two mechanisms that define it are PagedAttention - a memory management technique that treats the GPU's KV cache like virtual memory, reducing fragmentation and enabling higher batch occupancy - and continuous batching, which lets the server slot new requests into active batches rather than waiting for an entire batch to complete.

The result is that vLLM handles bursty, high-concurrency workloads well. When you have many users hitting the same model at the same time, vLLM's scheduler makes better use of the available GPU memory than a naive queue would.

Where it shines:

Production serving of large models on multi-GPU NVIDIA hardware
High concurrency - tens to hundreds of simultaneous requests
Multi-GPU tensor-parallel and pipeline-parallel setups
Speculative decoding and prefix caching

What it costs you: vLLM is not a light-weight thing to run. It requires a CUDA-capable GPU with meaningful VRAM, has a non-trivial startup time as it compiles CUDA kernels on first run, and its configuration surface is broad. If you are on a CPU, Apple Silicon, or a small consumer GPU, vLLM is not the right starting point.

SGLang#

SGLang is the inference server that matters most if your workloads involve structured output, constrained decoding, or multi-step agentic pipelines. Its core innovation is RadixAttention, which uses a radix tree to aggressively reuse the KV cache across requests that share a common prefix - a system prompt that is identical for thousands of requests, for example, only needs to be computed once.

That prefix-reuse characteristic makes SGLang particularly strong for agentic and RAG workflows where every request starts with a large, stable context. It also has first-class support for vision-language models (VLMs) and structured output formats, including JSON schema enforcement and regular expression constraints.

Where it shines:

Workloads with repeated, long system prompts (RAG, agents, tool-use pipelines)
Structured output at scale
VLM serving
High throughput on GPU, competitive with vLLM on many benchmarks

What it costs you: Like vLLM, SGLang is a GPU-first server. It requires Python, CUDA, and a reasonably modern NVIDIA GPU. The project moves fast, which is a strength, though it means the documentation and configuration API change more frequently than the more mature alternatives.

llama.cpp#

llama.cpp is the project that made running large language models on commodity hardware realistic. Written in C and C++, with minimal dependencies, it runs on nearly anything: x86 CPUs, ARM CPUs, Apple Silicon via Metal, NVIDIA GPUs via CUDA, AMD GPUs via ROCm, Vulkan-capable hardware, and combinations of the above.

Its native format is GGUF, a single-file model format that bundles weights, metadata, and tokeniser into one portable file. Quantisation is first-class: you can run a 7B model in 4-bit quantisation on 6 GB of RAM, or a 70B model across CPU and GPU memory using partial offloading. The model library at Hugging Face is full of GGUF-format models ready to use.

llama.cpp ships a server mode (llama-server) with an OpenAI-compatible API, so it is not just a library for other tools to embed - it is a complete inference backend in its own right.

Where it shines:

CPU inference with no GPU requirement
Apple Silicon (M1/M2/M3/M4) via Metal acceleration
Consumer GPUs where you need quantised models to fit in VRAM
Edge and embedded deployments
Low RAM budget environments
Portability across operating systems

What it costs you: llama.cpp does not scale horizontally the way vLLM and SGLang do. It is a single-node server. Throughput under high concurrency is lower than the GPU-optimised servers. If you are serving production traffic with many concurrent users and have the GPU hardware to match, you will outgrow llama.cpp's server mode fairly quickly.

Ollama#

Ollama is llama.cpp with a model management layer, a clean CLI, and a local API server that starts automatically in the background. The value it adds is operational: you pull a model with one command, it is cached locally, and a process is already listening on port 11434 ready to serve it.

# Pull and run a model - that is the entire workflow
ollama pull gemma4:4b
ollama run gemma4:4b

The API follows the OpenAI /v1/chat/completions and /v1/models conventions, which means anything that speaks to OpenAI will speak to Ollama with a one-line endpoint change. Ollama runs on macOS, Linux, and Windows, and on Apple Silicon it uses Metal for GPU acceleration automatically.

Where it shines:

Local development and prototyping
Getting started with self-hosted inference in under 5 minutes
macOS and Windows workstations where you want a managed background process
Teams where developers need a standard local setup without ops knowledge
Running models on Apple Silicon

What it costs you: Ollama's simplicity is also its ceiling. The concurrency model, configuration surface, and performance tuning options are narrower than vLLM or SGLang. It is not the right tool when you need to serve production traffic to many concurrent users or need fine-grained control over batching and memory.

Also in the field#

The four above cover the most common deployments, but the ecosystem is wider.

HuggingFace TGI (Text Generation Inference) was an influential early production server. As of March 2026, the repository is archived and TGI is in maintenance mode, with new development paused. Existing deployments continue to work, but it is no longer the recommended starting point for new projects.

TensorRT-LLM (NVIDIA) delivers the highest throughput numbers on NVIDIA hardware by compiling models directly to TensorRT engine format. The performance ceiling is real, but so is the operational complexity: it requires NVIDIA hardware at every stage, model-specific build steps, and a deeper understanding of the NVIDIA stack than most teams want to carry early on.

LMDeploy (from the Shanghai AI Lab team) is a strong choice for production serving on NVIDIA hardware, particularly for quantised models and long-context workloads. Its TurboMind engine has good benchmark results and it is worth evaluating if you need an alternative to vLLM.

MLX / mlx-lm is Apple's framework for fast inference on Apple Silicon. mlx-lm provides text generation and fine-tuning directly on M-series hardware, and is the recommended path for native Apple Silicon serving outside of Ollama and llama.cpp.

oMLX is a macOS-native inference server built on top of mlx-lm (not to be confused with Apple's MLX framework itself). Its standout feature is paged SSD KV caching: cache blocks are persisted to disk in safetensors format with a two-tier RAM/SSD architecture, so previously cached prefixes survive context shifts and server restarts without recomputation. That cuts time-to-first-token on long contexts from 30-90 seconds down to under 5 seconds, which matters most for coding agents that repeatedly invalidate the KV cache. It exposes both OpenAI-compatible (/v1/chat/completions) and Anthropic-compatible (/v1/messages) endpoints, supports continuous batching, multi-model serving (LLM, VLM, embedding, reranker), and ships as a signed, notarised macOS menu bar app. Apache 2.0, Apple Silicon only, requires macOS 15+. GitHub: jundot/omlx.

vLLM on Apple Silicon (sometimes called vLLM-MLX in community discussions) is vLLM's Apple Silicon support path - useful if you want a vLLM-compatible API surface on Mac hardware consistent with your GPU servers.

Lemonade / LemonadeSDK is a community-built local LLM server, sponsored and optimised by AMD, with particular attention to Ryzen AI and Radeon hardware. It exposes OpenAI, Anthropic, and Ollama-compatible APIs.

Docker Model Runner is worth a slightly fuller note because it fits a different mental model than the others. Rather than running a separate inference service, it brings model serving into the standard Docker workflow - you pull and run models as OCI artifacts with docker model run, and Docker exposes an OpenAI-compatible endpoint automatically. For teams already standardised on Docker, this removes the "extra service to operate" friction entirely and makes LLM serving feel like any other containerised workload.

LiteLLM appears on lists like this often, but it is more of a unified API proxy and gateway than an inference engine - it does not run models itself, it routes requests to backends that do. If you are evaluating it as an alternative to a proxy like Olla, see our Olla vs LiteLLM comparison.

Our tools work with all of these

Whichever backend you land on, you do not have to bet on just one. Olla supports all of the engines covered here - Ollama, vLLM, SGLang, llama.cpp, TensorRT-LLM, LMDeploy, MLX/mlx-lm, Lemonade, Docker Model Runner, LiteLLM, and more - as do FoundryOS and Alloy. We continue to add native support for new inference platforms as they mature, so your application layer stays stable while the backend landscape evolves.

Comparison table#

Engine	Primary use	Hardware target	Concurrency strength	Ease of operation	Quantisation	API compatibility	Best for
vLLM	Production serving	NVIDIA GPU (multi-GPU capable)	High - PagedAttention + continuous batching	Moderate - GPU setup, CUDA deps	GPTQ, AWQ, FP8	OpenAI `/v1`	High-concurrency production on NVIDIA
SGLang	Agentic / structured output	NVIDIA GPU	High - RadixAttention prefix reuse	Moderate - fast-moving project	GPTQ, AWQ, FP8	OpenAI `/v1`	Agentic pipelines, repeated prompts, VLMs
llama.cpp	Portability and low-resource	CPU, Apple Silicon, consumer GPU, edge	Low-moderate - single node	Low - minimal deps, static binary	GGUF (2-bit to 8-bit)	OpenAI `/v1` (via llama-server)	CPU/edge inference, quantised models
Ollama	Local dev UX	macOS, Linux, Windows, Apple Silicon	Low-moderate - wraps llama.cpp	Very low - one-command setup	GGUF (via llama.cpp)	OpenAI `/v1` + Ollama-native	Developer workstations, getting started
oMLX	Apple Silicon local inference with persistent KV cache	Apple Silicon only (macOS 15+)	Moderate - continuous batching via mlx-lm	Low - native macOS app, one-click setup	MLX-format models (4-bit, 8-bit; not GGUF)	OpenAI `/v1` + Anthropic `/v1/messages`	Mac-native serving for coding agents and long-context workloads

You do not have to pick just one#

Most teams end up running more than one backend

In practice, most teams end up with more than one inference backend running at the same time. The GPU box in the office runs vLLM and handles the heavier models. Developer workstations run Ollama. A Mac mini running llama.cpp serves as a fallback when the GPU box is at capacity. That is not a mess - it is a reasonable use of the hardware you have.

The operational problem that emerges is that every application now has to know about all of these backends, their different addresses, and their different model naming conventions. When one goes down, the application breaks. When you add a new backend, you update code.

This is exactly the problem an LLM proxy solves. A proxy sits in front of all your backends and gives applications a single endpoint. It handles health checking, failover, and model-aware routing, so your code never needs to care which backend is currently serving a request. For a full explanation of how this works, see What is an LLM proxy?.

Olla is our open-source proxy (Apache 2.0, written in Go) that supports all four engines covered here - Ollama, vLLM, SGLang, and llama.cpp - plus LM Studio, LiteLLM, LMDeploy, vLLM-MLX, Docker Model Runner, and others. It uses roughly 50 MB of RAM at rest. You point it at your backends, and your applications talk to one address regardless of what is running behind it.

If your hardware budget is the deciding factor across all of this, the self-hosted vs. cloud API cost comparison gives a practical breakdown of where self-hosting pays off.

You have NVIDIA GPU hardware and expect concurrent users: start with vLLM. It is the most mature production server for this case and its PagedAttention makes good use of GPU memory under load.
Your workloads are agentic or involve long repeated system prompts: SGLang's prefix caching pays off here. Also worth evaluating for VLM serving.
You are on CPU, Apple Silicon, or a small consumer GPU: llama.cpp is the right engine. GGUF quantisation gives you models that actually fit in your available memory.
You need something running on a developer workstation in under 10 minutes: Ollama. It handles the model management and process lifecycle so you do not have to.
You are running more than one backend: put a proxy in front of them. One endpoint, health checking, failover, and model-aware routing - without changing your application code.

LLM Inference Servers Compared - vLLM, SGLang, llama.cpp and Ollama

Choosing an inference server#

Which engine fits your situation

vLLM

SGLang

llama.cpp

Ollama

vLLM#

SGLang#

llama.cpp#

Ollama#

Also in the field#

Our tools work with all of these

Comparison table#

You do not have to pick just one#

Most teams end up running more than one backend

Key takeaways

Further reading#