Unified Model Registry
Discover and access models across all your inference backends through a single unified API. Olla automatically aggregates models from Ollama, LM Studio, vLLM and other OpenAI-compatible endpoints.
Model DiscoveryThe open-source AI inference proxy built for small businesses, development teams and growing organisations. Unified interface for Ollama, LM Studio, vLLM, sglang and llama.cpp with intelligent routing, automatic failover and production-grade reliability. Deploy in minutes with Docker.
Get up and running with Olla in minutes using containers.
Built as a dedicated LLM proxy, Olla provides unique features for unparalleled performance and efficiency in managing your AI infrastructure in a single binary.
Discover and access models across all your inference backends through a single unified API. Olla automatically aggregates models from Ollama, LM Studio, vLLM and other OpenAI-compatible endpoints.
Model DiscoveryDistribute requests efficiently across multiple backends with priority-based, round-robin and least-connections strategies. Automatic failover & recovery ensures high availability and minimal downtime.
Load BalancingBuilt for speed with connection pooling, object pooling and lock-free statistics. Olla delivers low-latency responses even under heavy load.
PerformanceCircuit breakers and health checks automatically detect and route around unhealthy backends. Real-time metrics provide visibility into your infrastructure.
Health MonitoringBuilt-in rate limiting and request validation to protect your infrastructure. Comprehensive audit logging tracks all requests.
SecurityComprehensive metrics and structured logging provide deep insights into your LLM infrastructure performance and behaviour.
ObservabilityEngineered for speed with minimal resource usage. Olla delivers enterprise-grade performance from a single binary that runs anywhere.
Real-time visibility into Olla's operations with comprehensive (compact) structured logging. Monitor startup, endpoint health checks and request processing with millisecond precision.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β β£β£β β β β β β£β£β β β β β β’°β‘β’Ήβ‘β β β β’°β‘β’Ήβ‘β‘ β β βββββββ βββ βββ ββββββ β β’Έβ‘β£Έβ‘·β β β »β’Ύβ£β£Έβ‘ β β ββββββββββββ βββ ββββββββ β’ β‘Ύβ β β β β β β β β β’·β‘ β β βββ ββββββ βββ ββββββββ β£Ώβ β’β£β’β£ β£€β£β‘β£ β‘β β£Ώ β β βββ ββββββ βββ ββββββββ β’»β£β β β‘β β’€β β’³β β β’ β‘Ώβ‘ β β ββββββββββββββββββββββββββββ βββ β£Όβ β β β ³β €β ¬β €β β β β β£· β β β’Έβ‘β β β β β β β β β β’Έβ‘ β β github.com/thushan/olla v0.0.21 β’Έβ‘ β β β β β β β β β β’β‘Ώ β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Olla is perfect for small businesses, development teams, startups and growing organisations that need reliable, production-grade AI routing without enterprise complexity. Ideal for teams getting started with AI infrastructure or running small to medium-scale deployments. For enterprise-scale operations, consider FoundryOS launching in Q2 2026.
Olla is a high-performance, open-source LLM proxy and load balancer that intelligently routes requests across multiple inference backends including Ollama, LM Studio, vLLM and SGLang. It provides unified model discovery, automatic failover and comprehensive observability for AI infrastructure.
Olla can be installed using Docker or Podman with a single command: docker run -t --name olla -p 40114:40114 ghcr.io/thushan/olla:latest. This pulls the latest image and starts the proxy on port 40114 (4 OLLA), see our Quickstart Guide.
Yes, Olla is completely free and open source software under the Apache License, Version 2.0. The source code is available on GitHub.
Olla is distributed under the Apache License, Version 2.0. Learn about Olla Licensing.
Olla supports multiple OpenAI-compatible inference backends including Ollama, LM Studio, vLLM, SGLang and any others natively as well as OpenAI API-compatible endpoints. It provides a unified interface across all these backends. Learn more about Olla Integrations.
Olla provides unified model registry, intelligent load balancing with priority-based routing, automatic failover and recovery, health monitoring with circuit breakers, built-in rate limiting and security features and comprehensive observability with structured logging and metrics. Olla also has an extensive list of other features you can learn more in the Olla Concepts.
Yes, Olla is specifically designed to work with local LLM deployments. It can route requests across local inference nodes running Ollama, LM Studio, or other local backends, providing load balancing and failover capabilities for self-hosted AI infrastructure.
Olla is open source and growing fast. Perfect for small businesses, development teams, and organisations building reliable, unified AI infrastructure at any scale.
Growing community with regular updates and improvements.
Star on GitHubWe welcome contributions of all sizes. Every improvement matters!
Get started with Olla today and experience intelligent load balancing for your AI workloads.