Olla Performance Benchmarks
A proxy in front of production has to earn its place. Four questions decide whether it does: does smart routing save anything, does the proxy add latency, how heavy is it, and does failover hold up when a backend dies. We measured all four on a single Windows workstation, using mock backends wherever the proxy itself is what's under test, so the numbers reflect Olla and not the GPU. It's one box, not a tuned cluster, so every figure is scoped to the conditions we state beside it.
Test environment
- Olla
- v0.0.28 (commit bd0c81e)
- Hardware
- AMD Ryzen 9 5950X (16C/32T), 64 GB RAM
- GPU
- NVIDIA RTX 4000 SFF Ada, 20 GB
- OS
- Windows 11 Pro (build 26200)
- Runtime
- Go 1.26, Ollama 0.30
- Date
- 8 June 2026
1. Sticky sessions and the KV cache
Send every turn of a conversation back to the same backend and its KV cache stays warm, so each turn only processes the new tokens. Round-robin sends alternate turns to a backend that has not seen the conversation, which then re-reads the whole prefix before it can answer. That cold-turn penalty grows with context length; sticky routing removes it from turn two on.
| Context | Round-robin (cold) | Sticky (warm) | Speedup |
|---|---|---|---|
| 2,048 | 2,177 ms | 912 ms | 2.4× |
| 8,192 | 3,446 ms | 1,628 ms | 2.1× |
| 16,384 | 4,814 ms | 1,694 ms | 2.8× |
| 32,768 | 8,126 ms | 1,072 ms | 7.6× |
Sticky pinned all 20 conversations to one backend for all 10 turns; hit rate 90% (the 10% misses are turn-1 cold inserts). The benefit needs a warm, long, multi-turn conversation across two or more backends - on short chats or a single backend there is little to gain. One backend here was a separate LAN node with a different GPU, so absolute times carry some hardware variance.
2. Proxy overhead and throughput
Measured against a zero-delay Go mock on localhost, so the difference is pure proxy cost - both paths hit the same local backend. At the median, Olla adds +0.30 ms.
| Concurrency | Olla p50 | Direct p50 | Overhead | Olla req/s |
|---|---|---|---|---|
| 1 | 0.40 ms | 0.10 ms | +0.30 ms | 2,354 |
| 8 | 0.60 ms | 0.20 ms | +0.40 ms | 8,804 |
| 32 | 1.90 ms | 1.20 ms | +0.70 ms | 10,229 |
| 64 | 3.50 ms | 1.80 ms | +1.70 ms | 9,788 |
Throughput peaks at ~10,200 req/s at c=32 - the routing-layer ceiling, not inference throughput. The balancer decision itself is ~67 ns per call (two endpoints). The p99 climbs at high concurrency, which is the coarser Go scheduler quantum on Windows; a Linux deployment should see a flatter tail.
3. Memory footprint
Olla idles at ~20 MB and holds 42-57 MB under concurrent load. The binary is 14.4 MB; the Go heap stays under 3 MB at rest.
| Endpoints | Idle RSS | Load steady RSS |
|---|---|---|
| 1 | 19.8 MB | 41.6 MB |
| 4 | 20.0 MB | 43.9 MB |
| 8 | 21.9 MB | 51.0 MB |
| 16 | 22.5 MB | 56.6 MB |
Head-to-head with LiteLLM
| Metric | Olla (Go) | LiteLLM (Python) | |
|---|---|---|---|
| Idle RSS | 20.2 MB | 298.3 MB | 14.8× lighter |
| Load steady RSS | 42.5 MB | 301.5 MB | 7.1× lighter |
| Overhead p50 | 0.2 ms | 63.0 ms | ~315× lower |
| Overhead p99 | 1.8 ms | 90.6 ms | ~50× lower |
Same zero-delay mock, c=16. This is a footprint comparison, not feature parity - LiteLLM ships spend tracking, caching and broad provider routing that Olla does not. It reflects Go net/http against a Python async stack.
4. Failover
0 requests dropped across 3 runs (20,223 total) with a backend hard-killed mid-load. Median detection 186 ms, recovery 264 ms.
Olla's retry reroutes a connection-refused error onto the surviving backend within the same in-flight request, so the kill is invisible to callers. Recovery beats the 2 s health-check interval because a restart usually lands mid-cycle.
Zero-drop holds when retry is enabled, the failure is a transport-level connection error (not an HTTP 5xx from a live backend), and at least one healthy backend remains.
Citable claims
- Sticky routing cut turn-2 TTFT on a 32,768-token conversation from 8,126 ms (round-robin, cold backend) to 1,072 ms (sticky, warm cache) - 7.6× faster, median of 5 conversations. Requires a warm multi-turn conversation across two or more backends with sticky sessions enabled.
- At 16,384 tokens the same turn-2 figure was 4,814 ms vs 1,694 ms (2.8×). The cold-turn penalty under round-robin grows with context; sticky stays flat at roughly 1-1.7 s.
- Olla adds +0.30 ms p50 over a direct call at c=1, measured against a zero-delay localhost mock (routing, balancer selection, header injection, connection pooling).
- Proxy throughput peaks at ~10,200 req/s at c=32 against a zero-delay mock. This is the routing-layer ceiling, not inference throughput.
- Round-robin endpoint selection costs ~67 ns per call at two endpoints (one allocation); sub-microsecond and not the bottleneck.
- Idle RSS is ~20 MB at 1-4 endpoints, ~22.5 MB at 16; under concurrent load 42-57 MB. The Go heap stays under ~3 MB idle.
- Against the same mock, Olla idles at 20.2 MB vs LiteLLM 1.88.0 at 298.3 MB (14.8× lighter) with p50 overhead 0.2 ms vs 63 ms. Footprint comparison, not feature parity.
- 0 requests dropped across 3 failover runs (20,223 total) when one of two backends was hard-killed mid-load; median detection 186 ms, recovery 264 ms. Requires retry enabled, a transport-level failure, and a surviving healthy backend.
Methodology and limitations
- Tests 2-4 used Go net/http mock backends on localhost so the proxy, not a GPU, is what is measured. The sticky test used two real Ollama nodes (one local RTX 4000, one LAN node with a different GPU).
- Sticky pinning used the X-Olla-Session-ID header; the prefix-hash key source does not pin across turns because the payload grows each turn.
- Sample sizes: 5 conversations x 10 turns per context length; n=500 x 3 repeats per overhead cell; 20,223 requests across 3 failover runs; memory sampled over 60-90 s.
- Single Windows 11 workstation. Multi-GPU or true-cluster behaviour, sustained soak stability, and real streaming-inference throughput are not characterised here.