The open-source AI inference proxy built for small businesses, development teams and growing organisations.
Unified interface for Ollama, LM Studio, vLLM, sglang and llama.cpp with intelligent routing,
automatic failover and production-grade reliability. Deploy in minutes with Docker.
Built as a dedicated LLM proxy, Olla provides unique features for unparalleled performance
and efficiency in managing your AI infrastructure in a single binary.
Unified Model Registry
Discover and access models across 8+ inference backends through a single unified API. Olla automatically aggregates models from Ollama, LM Studio, vLLM, SGLang and other OpenAI-compatible endpoints with seamless format translation - query any backend using your preferred API format.
Distribute requests efficiently across multiple backends with priority-based, round-robin, least-connections and weighted strategies. Choose the optimised algorithm for your workload - priority-based for tiered infrastructure, round-robin for uniform distribution, or least-connections for latency-sensitive applications. Automatic failover and recovery ensures high availability and minimal downtime.
Built on a hexagonal architecture with clean separation between core logic and adapters. Features dual proxy engines - Sherpa for lightweight traffic handling and Olla for full-featured proxying. Per-endpoint connection pooling, aggressive object pooling to minimise GC pressure and lock-free statistics deliver sub-millisecond overhead even under heavy load.
Engineered for speed with minimal resource usage. Olla delivers enterprise-grade performance
from a single binary that runs anywhere.
Structured Logs
Real-time visibility into Olla's operations with comprehensive (compact) structured logging.
Monitor startup, endpoint health checks and request processing with millisecond precision.
Olla is perfect for small businesses, development teams, startups and growing organisations that need reliable, production-grade AI routing without enterprise complexity. Ideal for teams getting started with AI infrastructure or running small to medium-scale deployments. For enterprise-scale operations, consider FoundryOS launching in Q2 2026.
What is Olla?
Olla is a high-performance, open-source LLM proxy and load balancer that intelligently routes requests across multiple inference backends including Ollama, LM Studio, vLLM and SGLang. It provides unified model discovery, automatic failover and comprehensive observability for AI infrastructure.
How do I install Olla?
Olla can be installed using Docker or Podman with a single command: docker run -t --name olla -p 40114:40114 ghcr.io/thushan/olla:latest. This pulls the latest image and starts the proxy on port 40114 (4 OLLA), see our Quickstart Guide.
Is Olla free and open source?
Yes, Olla is completely free and open source software under the Apache License, Version 2.0. The source code is available on GitHub.
What license is Olla distributed as?
Olla is distributed under the Apache License, Version 2.0. Learn about Olla Licensing.
What inference backends does Olla support?
Olla supports 9+ inference backends including Ollama, LM Studio, vLLM, vLLM-MLX, SGLang, llama.cpp (the ik_llama.cpp fork is also supported via the llamacpp backend), LiteLLM, Lemonade and Docker Model Runner, as well as a generic openai-compatible catch-all for any OpenAI API-compatible endpoint. It provides a unified interface with format translation across all backends. Learn more about Olla Integrations.
What are the main features of Olla?
Olla provides unified model registry, intelligent load balancing with priority-based routing, automatic failover and recovery, health monitoring with circuit breakers, built-in rate limiting and security features and comprehensive observability with structured logging and metrics. Olla also has an extensive list of other features you can learn more in the Olla Concepts.
Does Olla work with local LLM deployments?
Yes, Olla is specifically designed to work with local LLM deployments. It can route requests across local inference nodes running Ollama, LM Studio, or other local backends, providing load balancing and failover capabilities for self-hosted AI infrastructure.
OPEN SOURCE
Join the Olla Community
Olla is open source and growing fast. Perfect for small businesses, development teams,
and organisations building reliable, unified AI infrastructure at any scale.
Active Development
100+Stars
16K+Container Pulls
Growing community with regular updates and improvements.