Olla Performance Benchmarks

A proxy in front of production has to earn its place. Four questions decide whether it does: does smart routing save anything, does the proxy add latency, how heavy is it, and does failover hold up when a backend dies. We measured all four on a single Windows workstation, using mock backends wherever the proxy itself is what's under test, so the numbers reflect Olla and not the GPU. It's one box, not a tuned cluster, so every figure is scoped to the conditions we state beside it.

7.6×

faster cold-turn TTFT at 32k (sticky vs round-robin)

<1 ms

proxy overhead at the median (~0.30 ms p50)

14.8×

lighter at idle than the LiteLLM Python proxy

requests dropped across 20,223 during a backend kill

Test environment

Olla: v0.0.28 (commit bd0c81e)
Hardware: AMD Ryzen 9 5950X (16C/32T), 64 GB RAM
GPU: NVIDIA RTX 4000 SFF Ada, 20 GB
OS: Windows 11 Pro (build 26200)
Runtime: Go 1.26, Ollama 0.30
Date: 8 June 2026

1. Sticky sessions and the KV cache

Send every turn of a conversation back to the same backend and its KV cache stays warm, so each turn only processes the new tokens. Round-robin sends alternate turns to a backend that has not seen the conversation, which then re-reads the whole prefix before it can answer. That cold-turn penalty grows with context length; sticky routing removes it from turn two on.

Turn-2 time to first token versus context length: round-robin climbs from 2.2s at 2k tokens to 8.1s at 32k, while sticky stays flat around 1s. — First warm turn (turn 2): round-robin pays a cold re-read that scales with context; sticky holds flat. Median of 5 conversations per context length.

Context	Round-robin (cold)	Sticky (warm)	Speedup
2,048	2,177 ms	912 ms	2.4×
8,192	3,446 ms	1,628 ms	2.1×
16,384	4,814 ms	1,694 ms	2.8×
32,768	8,126 ms	1,072 ms	7.6×

Sticky pinned all 20 conversations to one backend for all 10 turns; hit rate 90% (the 10% misses are turn-1 cold inserts). The benefit needs a warm, long, multi-turn conversation across two or more backends - on short chats or a single backend there is little to gain. One backend here was a separate LAN node with a different GPU, so absolute times carry some hardware variance.

2. Proxy overhead and throughput

Measured against a zero-delay Go mock on localhost, so the difference is pure proxy cost - both paths hit the same local backend. At the median, Olla adds +0.30 ms.

Concurrency	Olla p50	Direct p50	Overhead	Olla req/s
1	0.40 ms	0.10 ms	+0.30 ms	2,354
8	0.60 ms	0.20 ms	+0.40 ms	8,804
32	1.90 ms	1.20 ms	+0.70 ms	10,229
64	3.50 ms	1.80 ms	+1.70 ms	9,788

Throughput peaks at ~10,200 req/s at c=32 - the routing-layer ceiling, not inference throughput. The balancer decision itself is ~67 ns per call (two endpoints). The p99 climbs at high concurrency, which is the coarser Go scheduler quantum on Windows; a Linux deployment should see a flatter tail.

3. Memory footprint

Olla idles at ~20 MB and holds 42-57 MB under concurrent load. The binary is 14.4 MB; the Go heap stays under 3 MB at rest.

Endpoints	Idle RSS	Load steady RSS
1	19.8 MB	41.6 MB
4	20.0 MB	43.9 MB
8	21.9 MB	51.0 MB
16	22.5 MB	56.6 MB

Head-to-head with LiteLLM

Metric	Olla (Go)	LiteLLM (Python)
Idle RSS	20.2 MB	298.3 MB	14.8× lighter
Load steady RSS	42.5 MB	301.5 MB	7.1× lighter
Overhead p50	0.2 ms	63.0 ms	~315× lower
Overhead p99	1.8 ms	90.6 ms	~50× lower

Same zero-delay mock, c=16. This is a footprint comparison, not feature parity - LiteLLM ships spend tracking, caching and broad provider routing that Olla does not. It reflects Go net/http against a Python async stack.

4. Failover

0 requests dropped across 3 runs (20,223 total) with a backend hard-killed mid-load. Median detection 186 ms, recovery 264 ms.

Olla's retry reroutes a connection-refused error onto the surviving backend within the same in-flight request, so the kill is invisible to callers. Recovery beats the 2 s health-check interval because a restart usually lands mid-cycle.

Zero-drop holds when retry is enabled, the failure is a transport-level connection error (not an HTTP 5xx from a live backend), and at least one healthy backend remains.

Citable claims

Sticky routing cut turn-2 TTFT on a 32,768-token conversation from 8,126 ms (round-robin, cold backend) to 1,072 ms (sticky, warm cache) - 7.6× faster, median of 5 conversations. Requires a warm multi-turn conversation across two or more backends with sticky sessions enabled.
At 16,384 tokens the same turn-2 figure was 4,814 ms vs 1,694 ms (2.8×). The cold-turn penalty under round-robin grows with context; sticky stays flat at roughly 1-1.7 s.
Olla adds +0.30 ms p50 over a direct call at c=1, measured against a zero-delay localhost mock (routing, balancer selection, header injection, connection pooling).
Proxy throughput peaks at ~10,200 req/s at c=32 against a zero-delay mock. This is the routing-layer ceiling, not inference throughput.
Round-robin endpoint selection costs ~67 ns per call at two endpoints (one allocation); sub-microsecond and not the bottleneck.
Idle RSS is ~20 MB at 1-4 endpoints, ~22.5 MB at 16; under concurrent load 42-57 MB. The Go heap stays under ~3 MB idle.
Against the same mock, Olla idles at 20.2 MB vs LiteLLM 1.88.0 at 298.3 MB (14.8× lighter) with p50 overhead 0.2 ms vs 63 ms. Footprint comparison, not feature parity.
0 requests dropped across 3 failover runs (20,223 total) when one of two backends was hard-killed mid-load; median detection 186 ms, recovery 264 ms. Requires retry enabled, a transport-level failure, and a surviving healthy backend.

Methodology and limitations

Tests 2-4 used Go net/http mock backends on localhost so the proxy, not a GPU, is what is measured. The sticky test used two real Ollama nodes (one local RTX 4000, one LAN node with a different GPU).
Sticky pinning used the X-Olla-Session-ID header; the prefix-hash key source does not pin across turns because the payload grows each turn.
Sample sizes: 5 conversations x 10 turns per context length; n=500 x 3 repeats per overhead cell; 20,223 requests across 3 failover runs; memory sampled over 60-90 s.
Single Windows 11 workstation. Multi-GPU or true-cluster behaviour, sustained soak stability, and real streaming-inference throughput are not characterised here.

Olla on GitHub About Olla