← Back to Olla Benchmarks

Olla Performance Benchmarks

A proxy in front of production has to earn its place. Four questions decide whether it does: does smart routing save anything, does the proxy add latency, how heavy is it, and does failover hold up when a backend dies. We measured all four on a single Windows workstation, using mock backends wherever the proxy itself is what's under test, so the numbers reflect Olla and not the GPU. It's one box, not a tuned cluster, so every figure is scoped to the conditions we state beside it.

7.6×
faster cold-turn TTFT at 32k (sticky vs round-robin)
<1 ms
proxy overhead at the median (~0.30 ms p50)
14.8×
lighter at idle than the LiteLLM Python proxy
0
requests dropped across 20,223 during a backend kill

Test environment

Olla
v0.0.28 (commit bd0c81e)
Hardware
AMD Ryzen 9 5950X (16C/32T), 64 GB RAM
GPU
NVIDIA RTX 4000 SFF Ada, 20 GB
OS
Windows 11 Pro (build 26200)
Runtime
Go 1.26, Ollama 0.30
Date
8 June 2026

1. Sticky sessions and the KV cache

Send every turn of a conversation back to the same backend and its KV cache stays warm, so each turn only processes the new tokens. Round-robin sends alternate turns to a backend that has not seen the conversation, which then re-reads the whole prefix before it can answer. That cold-turn penalty grows with context length; sticky routing removes it from turn two on.

Turn-2 time to first token versus context length: round-robin climbs from 2.2s at 2k tokens to 8.1s at 32k, while sticky stays flat around 1s.
First warm turn (turn 2): round-robin pays a cold re-read that scales with context; sticky holds flat. Median of 5 conversations per context length.
ContextRound-robin (cold)Sticky (warm)Speedup
2,0482,177 ms912 ms2.4×
8,1923,446 ms1,628 ms2.1×
16,3844,814 ms1,694 ms2.8×
32,7688,126 ms1,072 ms7.6×

Sticky pinned all 20 conversations to one backend for all 10 turns; hit rate 90% (the 10% misses are turn-1 cold inserts). The benefit needs a warm, long, multi-turn conversation across two or more backends - on short chats or a single backend there is little to gain. One backend here was a separate LAN node with a different GPU, so absolute times carry some hardware variance.

2. Proxy overhead and throughput

Measured against a zero-delay Go mock on localhost, so the difference is pure proxy cost - both paths hit the same local backend. At the median, Olla adds +0.30 ms.

ConcurrencyOlla p50Direct p50OverheadOlla req/s
10.40 ms0.10 ms+0.30 ms2,354
80.60 ms0.20 ms+0.40 ms8,804
321.90 ms1.20 ms+0.70 ms10,229
643.50 ms1.80 ms+1.70 ms9,788

Throughput peaks at ~10,200 req/s at c=32 - the routing-layer ceiling, not inference throughput. The balancer decision itself is ~67 ns per call (two endpoints). The p99 climbs at high concurrency, which is the coarser Go scheduler quantum on Windows; a Linux deployment should see a flatter tail.

3. Memory footprint

Olla idles at ~20 MB and holds 42-57 MB under concurrent load. The binary is 14.4 MB; the Go heap stays under 3 MB at rest.

EndpointsIdle RSSLoad steady RSS
119.8 MB41.6 MB
420.0 MB43.9 MB
821.9 MB51.0 MB
1622.5 MB56.6 MB

Head-to-head with LiteLLM

MetricOlla (Go)LiteLLM (Python)
Idle RSS20.2 MB298.3 MB14.8× lighter
Load steady RSS42.5 MB301.5 MB7.1× lighter
Overhead p500.2 ms63.0 ms~315× lower
Overhead p991.8 ms90.6 ms~50× lower

Same zero-delay mock, c=16. This is a footprint comparison, not feature parity - LiteLLM ships spend tracking, caching and broad provider routing that Olla does not. It reflects Go net/http against a Python async stack.

4. Failover

0 requests dropped across 3 runs (20,223 total) with a backend hard-killed mid-load. Median detection 186 ms, recovery 264 ms.

Olla's retry reroutes a connection-refused error onto the surviving backend within the same in-flight request, so the kill is invisible to callers. Recovery beats the 2 s health-check interval because a restart usually lands mid-cycle.

Zero-drop holds when retry is enabled, the failure is a transport-level connection error (not an HTTP 5xx from a live backend), and at least one healthy backend remains.

Citable claims

  • Sticky routing cut turn-2 TTFT on a 32,768-token conversation from 8,126 ms (round-robin, cold backend) to 1,072 ms (sticky, warm cache) - 7.6× faster, median of 5 conversations. Requires a warm multi-turn conversation across two or more backends with sticky sessions enabled.
  • At 16,384 tokens the same turn-2 figure was 4,814 ms vs 1,694 ms (2.8×). The cold-turn penalty under round-robin grows with context; sticky stays flat at roughly 1-1.7 s.
  • Olla adds +0.30 ms p50 over a direct call at c=1, measured against a zero-delay localhost mock (routing, balancer selection, header injection, connection pooling).
  • Proxy throughput peaks at ~10,200 req/s at c=32 against a zero-delay mock. This is the routing-layer ceiling, not inference throughput.
  • Round-robin endpoint selection costs ~67 ns per call at two endpoints (one allocation); sub-microsecond and not the bottleneck.
  • Idle RSS is ~20 MB at 1-4 endpoints, ~22.5 MB at 16; under concurrent load 42-57 MB. The Go heap stays under ~3 MB idle.
  • Against the same mock, Olla idles at 20.2 MB vs LiteLLM 1.88.0 at 298.3 MB (14.8× lighter) with p50 overhead 0.2 ms vs 63 ms. Footprint comparison, not feature parity.
  • 0 requests dropped across 3 failover runs (20,223 total) when one of two backends was hard-killed mid-load; median detection 186 ms, recovery 264 ms. Requires retry enabled, a transport-level failure, and a surviving healthy backend.

Methodology and limitations

  • Tests 2-4 used Go net/http mock backends on localhost so the proxy, not a GPU, is what is measured. The sticky test used two real Ollama nodes (one local RTX 4000, one LAN node with a different GPU).
  • Sticky pinning used the X-Olla-Session-ID header; the prefix-hash key source does not pin across turns because the payload grows each turn.
  • Sample sizes: 5 conversations x 10 turns per context length; n=500 x 3 repeats per overhead cell; 20,223 requests across 3 failover runs; memory sampled over 60-90 s.
  • Single Windows 11 workstation. Multi-GPU or true-cluster behaviour, sustained soak stability, and real streaming-inference throughput are not characterised here.