What we found when we benchmarked Olla

We kept getting asked the same few things about Olla: does the routing actually save anything, does a proxy in the path cost you latency, how heavy is it, and does failover hold up when a backend dies. So we ran the tests and wrote down what came back, including a head-to-head against LiteLLM.

A word on the setup. We ran this on a Windows dev workstation, not a tuned Linux server, because that is where a lot of people actually try this stuff first: a Ryzen 9 5950X with an RTX 4000 Ada, Olla v0.0.28, Go 1.26. For everything except the KV-cache test we put Olla in front of a zero-delay mock backend, so the numbers measure the proxy itself rather than how fast a GPU happens to be that day. The KV-cache test used two real Ollama nodes. Conditions are noted throughout, and where a number only holds under specific circumstances we say so.

Sticky sessions and the KV cache#

Send every turn of a conversation back to the same backend and its KV cache stays warm, so each turn only has to process the new tokens. Round-robin sends every other turn to a backend that has not seen the conversation, which then re-reads the whole thing before it can answer. The longer the conversation, the more that cold re-read costs.

We ran the same ten-turn conversation under both routing modes, at four context lengths, five conversations each. The clearest way to see it is the first warm turn (turn two) as context grows:

Turn-2 time to first token by context length: round-robin climbs from 2.2s at 2k tokens to 8.1s at 32k, while sticky stays flat near 1s.

Round-robin's cold turn climbs from 2.2s at 2k tokens to 8.1s at 32k. Sticky barely moves. At 32k that is 8,126 ms down to 1,072 ms, about 7.6x. At 16k it is 4,814 ms vs 1,694 ms. The benefit tracks context length, which makes sense: the bigger the prefix, the more there is to avoid re-reading.

Does the proxy cost you anything#

Against the zero-delay mock, with both paths hitting the same local backend, Olla added 0.30 ms at the median. The routing decision on its own is far smaller again: round-robin selection benchmarks at about 67 nanoseconds a call. Throughput through the proxy topped out around 10,200 requests a second at 32 concurrent connections, which is the routing layer's ceiling, not an inference number.

Memory, and how it compares to LiteLLM#

Olla is a single Go binary and it behaves like one: about 20 MB resident at idle, rising to 42-57 MB under concurrent load depending on how many endpoints it is watching. The binary on disk is 14.4 MB and the Go heap stays under 3 MB at rest. (This corrects the "30-50 MB" we had on the site, which was too high for idle and only right at the top end under load.)

The comparison that makes it land is against LiteLLM, a popular Python proxy, on the same mock backend:

Metric	Olla (Go)	LiteLLM (Python)
Idle memory	20.2 MB	298.3 MB
Overhead (p50)	0.2 ms	63 ms

That is roughly 15x lighter at idle. To be fair to LiteLLM: it does a great deal Olla does not, spend tracking, caching, broad provider routing, so this is a footprint comparison, not a feature one. It mostly reflects Go's net/http against a Python async stack.

Failover#

We pointed load at two backends and hard-killed one about twelve seconds into the run, three times over.

Across all three runs, 20,223 requests in total, not one was dropped. Olla noticed the connection refused, took the dead backend out of rotation, and moved the in-flight and following requests onto the survivor before any of it reached a client. Median time to reroute everything was 186 ms; when we restarted the backend it was back in rotation in about 264 ms, faster than the 2-second health check because the restart usually lands mid-cycle.

That zero holds under specific conditions: retry has to be on, the failure has to be a real connection error rather than an HTTP 500 from a backend that is still up, and there has to be another healthy backend to move to.

Takeaways#

Sticky routing keeps cold-turn TTFT flat while round-robin climbs with context, up to 7.6x at 32k tokens. Worth it for long, warm, multi-turn conversations across several backends.
The proxy adds about 0.30ms at the median and a routing decision is 67ns. It is not what your users wait on.
Olla idles at about 20MB from a 14.4MB binary, against roughly 300MB for LiteLLM.
A backend was killed mid-run three times and not one of 20,223 requests failed.
Measured on a single Windows dev box with conditions noted. The benchmarks page has the full method and tables.