Stable β€’ Open Source β€’ Apache 2.0

Olla

The open-source AI inference proxy built for small businesses, development teams and growing organisations. Unified interface for Ollama, LM Studio, vLLM, sglang and llama.cpp with intelligent routing, automatic failover and production-grade reliability. Deploy in minutes with Docker.

16K+
Container Pulls
Growing community adoption
100+
GitHub Stars
Active open source project
Apache 2.0
Open Source
Free forever, enterprise-friendly

Quick Start

Get up and running with Olla in minutes using containers.

$ docker run -t --name olla -p 40114:40114 ghcr.io/thushan/olla:latest
Unable to find image 'ghcr.io/thushan/olla:latest' locally latest: Pulling from thushan/olla a280336c5efa: Pull complete c439f8aafbda: Download complete 4f4fb700ef54: Pull complete 28ca2a5cf8b5: Download complete 8066e87b2613: Pull complete Digest: sha256:e03b64577b3794d3b31d76640dc93d2bfaf184796915fcbbca37dc4fcff140f7 Status: Downloaded newer image for ghcr.io/thushan/olla:latest
$ curl http://localhost:40114/olla/models
{"object":"list","data":[{"olla":{"family":"","variant":"","parameter_size":"","quantization":"","aliases":["ai/gpt-oss"],"availability":[{"endpoint":"local-docker","state":"unknown"}],"capabilities":["text-generation"]},"id":"ai/gpt-oss","object":"model","owned_by":"olla","created":1762542266}]}

Product Overview

Built as a dedicated LLM proxy, Olla provides unique features for unparalleled performance and efficiency in managing your AI infrastructure in a single binary.

Unified Model Registry

Discover and access models across 8+ inference backends through a single unified API. Olla automatically aggregates models from Ollama, LM Studio, vLLM, SGLang and other OpenAI-compatible endpoints with seamless format translation - query any backend using your preferred API format.

Model Discovery

Intelligent Load Balancing

Distribute requests efficiently across multiple backends with priority-based, round-robin, least-connections and weighted strategies. Choose the optimised algorithm for your workload - priority-based for tiered infrastructure, round-robin for uniform distribution, or least-connections for latency-sensitive applications. Automatic failover and recovery ensures high availability and minimal downtime.

Load Balancing

High Performance

Built on a hexagonal architecture with clean separation between core logic and adapters. Features dual proxy engines - Sherpa for lightweight traffic handling and Olla for full-featured proxying. Per-endpoint connection pooling, aggressive object pooling to minimise GC pressure and lock-free statistics deliver sub-millisecond overhead even under heavy load.

Performance

Health Monitoring

Circuit breakers with configurable thresholds automatically detect and route around unhealthy backends. Exponential backoff prevents cascading failures whilst gradual recovery reintroduces healed endpoints. Real-time metrics provide visibility into failure rates, recovery times and overall infrastructure health.

Health Monitoring

Security & Rate Limiting

Built-in rate limiting and request validation to protect your infrastructure. Comprehensive audit logging tracks all requests.

Security

Observability

Comprehensive metrics and structured logging provide deep insights into your LLM infrastructure performance and behaviour.

Observability

Blazing Fast Performance

Engineered for speed with minimal resource usage. Olla delivers enterprise-grade performance from a single binary that runs anywhere.

Structured Logs

Real-time visibility into Olla's operations with comprehensive (compact) structured logging. Monitor startup, endpoint health checks and request processing with millisecond precision.

olla
╔─────────────────────────────────────────────────────────╗
β”‚                                      β €β €β£€β£€β €β €β €β €β €β£€β£€β €β €    β”‚
β”‚                                      ⠀⒰⑏Ⓓ⑆⠀⠀⠀⒰⑏Ⓓ⑆⑀   β”‚
β”‚   β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•— β–ˆβ–ˆβ•—     β–ˆβ–ˆβ•—      β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—    β €β’Έβ‘‡β£Έβ‘·β Ÿβ ›β »β’Ύβ£‡β£Έβ‘‡    β”‚
β”‚  β–ˆβ–ˆβ•”β•β•β•β–ˆβ–ˆβ•—β–ˆβ–ˆβ•‘     β–ˆβ–ˆβ•‘     β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•—   β’ β‘Ύβ ›β ‰β β €β €β €β ˆβ ‰β ›β’·β‘„   β”‚
β”‚  β–ˆβ–ˆβ•‘   β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•‘     β–ˆβ–ˆβ•‘     β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•‘   ⣿⠀⒀⣄⒀⣠⣀⣄⑀⣠⑀⠀⣿   β”‚
β”‚  β–ˆβ–ˆβ•‘   β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•‘     β–ˆβ–ˆβ•‘     β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•‘   β’»β£„β ˜β ‹β‘žβ ‰β’€β ‰β’³β ™β ƒβ’ β‘Ώβ‘€ β”‚
β”‚  β•šβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•”β•β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—β–ˆβ–ˆβ•‘  β–ˆβ–ˆβ•‘   β£Όβ ƒβ €β €β ³β €β ¬β €β žβ €β €β ˜β£·   β”‚
β”‚                                      β’Έβ‘Ÿβ €β €β €β €β €β €β €β €β €β’Έβ‘‡    β”‚
β”‚  github.com/thushan/olla   v0.0.21   β’Έβ‘…β €β €β €β €β €β €β €β €β €β’€β‘Ώ    β”‚
β•šβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β•
β–ˆ

Supported Backends

Olla provides integrations for a variety of backends. The ik_llama.cpp fork is supported via the llama.cpp backend.

See Integrations
OllamaLM StudiovLLMLiteLLMOpenAI-likeLemonadeSGLangllama.cppvLLM-MLXDocker Model Runner

Frequently Asked Questions

Who is Olla designed for?

Olla is perfect for small businesses, development teams, startups and growing organisations that need reliable, production-grade AI routing without enterprise complexity. Ideal for teams getting started with AI infrastructure or running small to medium-scale deployments. For enterprise-scale operations, consider FoundryOS launching in Q2 2026.

What is Olla?

Olla is a high-performance, open-source LLM proxy and load balancer that intelligently routes requests across multiple inference backends including Ollama, LM Studio, vLLM and SGLang. It provides unified model discovery, automatic failover and comprehensive observability for AI infrastructure.

How do I install Olla?

Olla can be installed using Docker or Podman with a single command: docker run -t --name olla -p 40114:40114 ghcr.io/thushan/olla:latest. This pulls the latest image and starts the proxy on port 40114 (4 OLLA), see our Quickstart Guide.

Is Olla free and open source?

Yes, Olla is completely free and open source software under the Apache License, Version 2.0. The source code is available on GitHub.

What license is Olla distributed as?

Olla is distributed under the Apache License, Version 2.0. Learn about Olla Licensing.

What inference backends does Olla support?

Olla supports 9+ inference backends including Ollama, LM Studio, vLLM, vLLM-MLX, SGLang, llama.cpp (the ik_llama.cpp fork is also supported via the llamacpp backend), LiteLLM, Lemonade and Docker Model Runner, as well as a generic openai-compatible catch-all for any OpenAI API-compatible endpoint. It provides a unified interface with format translation across all backends. Learn more about Olla Integrations.

What are the main features of Olla?

Olla provides unified model registry, intelligent load balancing with priority-based routing, automatic failover and recovery, health monitoring with circuit breakers, built-in rate limiting and security features and comprehensive observability with structured logging and metrics. Olla also has an extensive list of other features you can learn more in the Olla Concepts.

Does Olla work with local LLM deployments?

Yes, Olla is specifically designed to work with local LLM deployments. It can route requests across local inference nodes running Ollama, LM Studio, or other local backends, providing load balancing and failover capabilities for self-hosted AI infrastructure.

OPEN SOURCE

Join the Olla Community

Olla is open source and growing fast. Perfect for small businesses, development teams, and organisations building reliable, unified AI infrastructure at any scale.

Active Development

100+ Stars
16K+ Container Pulls

Growing community with regular updates and improvements.

Star on GitHub

Contribute

We welcome contributions of all sizes. Every improvement matters!

  • πŸ› Report bugs and issues
  • ✨ Suggest new features
  • πŸ“ Improve documentation
  • πŸ”§ Submit pull requests
Contributing Guide

Perfect For

  • Small Businesses Reliable AI infrastructure for growing teams
  • Development Teams Unified API for multiple AI backends
  • Startups Build without cloud dependencies
  • Homelabs & Researchers Test and compare different models

Ready to optimise your LLM infrastructure?

Get started with Olla today and experience intelligent load balancing for your AI workloads.