What is an LLM Proxy?
What is an LLM proxy?#
Think of it the same way you think of an Nginx sitting in front of three web servers. Callers only know about one address. What happens behind that address is the proxy's problem.
Here's what we'll cover:
- The problem that drives you to need one
- What an LLM proxy actually does (routing, health checks, sticky sessions, observability)
- How it differs from a plain load balancer and from a "gateway"
- A concrete example using Olla
- When you genuinely do not need one
The problem it solves#
Most teams that run local inference don't start with one backend. They end up with several:
- Ollama running on a developer workstation
- vLLM on the team's GPU box
- llama.cpp on someone's Mac
- LM Studio on a laptop used for demos
Each of these has its own address, its own quirks, and its own model naming conventions (gemma4:4b on Ollama, Qwen/Qwen3.6-8B-Instruct on vLLM, something else on llama.cpp). Your application hardcodes one of them. When that backend goes down, restarts, or gets replaced, everything breaks.
You also have no visibility into which backend is actually handling requests, how long inference is taking on each one, or whether a backend has started returning errors.
An LLM proxy solves all of this by giving you one front door.
Let the proxy worry about nodes coming up and down as you go about your LLM adventures.
What an LLM proxy actually does#
We think LLM Proxies should do a few important things.
Unified API and model discovery#
A well-built LLM proxy exposes a single OpenAI compatible endpoint (and often an Anthropic-compatible one too) to interact with your LLM backends. Your application sends requests to http://proxy:11400/v1/chat/completions - the same request it would send to the OpenAI API - and the proxy handles the backend-specific translation.
It also discovers which models are available across all registered backends and exposes them through a single /v1/models endpoint. This presents your model catalog, unifying available models from all your backends.
You query one place and see everything.
Load balancing and routing#
With multiple backends available, the proxy can spread requests across them. Simple round-robin works fine for stateless requests. Priority-based routing lets you say "prefer the GPU box for heavy models, fall back to the CPU server if it's busy."
The proxy can also route by model: requests for gemma4:4b go to the backend that actually has it loaded, rather than broadcasting to all of them.
The same backends, three different ways to pick one. These are the most common approaches - but there are many more.
Round-robin rotates through the backends, one request after another:
Priority-based always prefers your best backend, and falls back only when it has to:
Least-connections sends each request to whichever backend is least busy:
Health checks and automatic failover#
This is where things get really useful. The proxy polls each backend on a configurable interval and marks it healthy or unhealthy. When a backend stops responding - the GPU machine reboots, the container crashes, the vLLM process OOMs - the proxy routes around it automatically.
From the application's perspective, nothing changed. The request succeeds, just via a different backend. You find out about the failure when you check your proxy dashboard, not when a user reports a 502.
Sticky sessions and KV-cache affinity#
Multi-turn conversations are the interesting case. Each time a user sends a message in a conversation, the inference backend has to process all previous context. Most backends maintain a KV (key-value) cache from previous requests: if the same backend handles turn two of a conversation, it can reuse the cached computation from turn one rather than recomputing the whole context from scratch.
A sticky session policy solves this: the proxy hashes some identifier (a session ID header, or a hash of the message prefix) and routes subsequent requests in the same conversation to the same backend. The KV cache stays warm, and your latency on turn three, four, five stays low rather than spiking.
This matters more as conversations get longer. For short exchanges it's a minor concern. For a coding assistant with a 20,000-token context, it's significant.
Observability#
A good proxy tracks per-backend stats: request counts, error rates, latency percentiles, routing decisions. This gives you a clear picture of which backends are doing the work, which are slow, and which are failing silently.
Without this, you're flying blind. "The AI feels slow today" is very different from "the vLLM box has a p95 latency of 8 seconds and Ollama is handling 80% of traffic."
Proxy vs. gateway vs. load balancer#
These terms get conflated. Here's a practical distinction:
A plain load balancer (HAProxy, Nginx upstream, AWS ALB) is backend-agnostic. It doesn't understand what's inside the request. It can route by headers or source IP, but it has no concept of model names, streaming responses, or inference semantics. You could put one in front of vLLM instances, but you'd lose model-aware routing and you'd have to implement health checks or model unification yourself.
An LLM proxy understands inference traffic. It knows about model names, OpenAI-compatible endpoints, streaming (server-sent events), KV cache affinity, and per-backend capabilities. It's a specialised reverse proxy, not a general one.
An LLM gateway typically adds a layer on top of that: authentication, per-key rate limiting, cost attribution, multi-tenant key management, budget enforcement. If you're running inference for multiple teams or customers and need to track and control who is using what, you want a gateway. Alloy is our gateway tier - it sits above the proxy layer and adds those controls. But most teams running a couple of local backends don't need that complexity yet; a proxy is the right starting point.
An orchestration platform is the step up again. Once you're past a handful of backends and into a fleet of inference nodes - many GPUs across racks or clusters, with scheduling, autoscaling, fleet health, and rollout management - you've moved from "route my requests" to "run my inference estate." That's a platform problem, not a proxy one. FoundryOS is our enterprise platform for exactly that: fleet orchestration and monitoring for inference at scale (currently in early access). Most teams won't need it for a long time, and that's fine - it's the far end of the same road.
The short version: a load balancer moves bytes, a proxy gives you one resilient endpoint, a gateway adds multi-tenant control, and a platform orchestrates a whole fleet. Start with a proxy and climb the ladder only when the problem in front of you actually demands it.
A concrete example with Olla#
Olla is our open-source project (Apache 2.0, written in Go) that implements this pattern. It uses roughly 50MB of RAM at rest, has sub-millisecond endpoint selection, and supports Ollama, LM Studio, vLLM, SGLang, llama.cpp, LiteLLM, LMDeploy, vLLM-MLX, Docker Model Runner, and LemonadeSDK as backends.
The idea is straightforward. You point Olla at your backends:
# Illustrative config - see https://thushan.github.io/olla for exact syntax
endpoints:
- name: workstation
url: http://localhost:11434 # Ollama
type: ollama
priority: 2
- name: gpu-box
url: http://192.168.1.50:8000 # vLLM
type: vllm
priority: 1 # preferred
- name: mac-studio
url: http://192.168.1.60:1234 # LM Studio
type: lmstudio
priority: 3 # fallback Your application then talks to Olla's single endpoint:
curl http://localhost:11400/v1/chat/completions
-H "Content-Type: application/json"
-d '{
"model": "gemma4:4b",
"messages": [{"role": "user", "content": "What is an LLM proxy?"}]
}' Olla selects the highest-priority healthy backend that has the model loaded, forwards the request, and streams the response back. If the gpu-box goes down, the next request goes to the workstation without any change to the calling application.
For sticky sessions, Olla uses a prefix hash on the message history to route multi-turn conversations to a consistent backend. The full details are in the Olla Sticky Sessions documentation.
To be clear, Olla is one example of this pattern, not the only one. If you have a single backend and a single user, you don't need a proxy.
When you do not need one#
A proxy adds a hop and a configuration surface. For a lot of setups, that is not worth it:
- Single backend, single user. You have Ollama on your laptop and you're the only person using it. Just point your tool at Ollama directly.
- Hobby or experimental setup. You're trying out models, not running a service. A proxy would be overkill.
- Already using a managed service. If you're calling OpenAI or Anthropic directly and have no local backends, a proxy is the wrong tool (you might want a gateway for key management, but that's different).
The inflection point is usually: more than one backend, or more than one person using the same inference infrastructure. That's when the friction of hardcoded backend URLs starts to cost you, and a proxy pays for itself in the first week.
Wrapping up#
An LLM proxy is not a complicated concept: it is a reverse proxy that understands inference traffic. One endpoint in front of many backends, with health checks, failover, model-aware routing, sticky sessions for KV cache efficiency, and observability.
If you're running local inference at any meaningful scale - multiple backends, a team using shared infrastructure, or anything you need to stay up reliably - a proxy belongs in your stack. If you're tinkering solo on a single machine, it doesn't.
And when one proxy grows into a fleet of inference nodes across an organisation, that's when a platform like FoundryOS takes over the orchestration. That's a good problem to have later, not a reason to over-build today. Start with the proxy, and let the rest of the stack earn its place as you grow.
Further reading#
- Olla on GitHub - open-source LLM proxy (Apache 2.0)
- Olla documentation - configuration, backend integrations, sticky sessions
- Ollama - local inference, macOS/Linux/Windows
- vLLM documentation - high-throughput inference server
- SGLang - fast LLM and VLM serving
- llama.cpp - CPU and GPU inference in C++
- LM Studio - desktop inference with GUI
- OpenAI API reference - the de facto API standard most backends implement