Stable β€’ Open Source β€’ Apache 2.0

Olla

The open-source AI inference proxy built for small businesses, development teams and growing organisations. Unified interface for Ollama, LM Studio, vLLM, sglang and llama.cpp with intelligent routing, automatic failover and production-grade reliability. Deploy in minutes with Docker.

3.4K+
Downloads
Growing community adoption
100+
GitHub Stars
Active open source project
Apache 2.0
Open Source
Free forever, enterprise-friendly

Quick Start

Get up and running with Olla in minutes using containers.

$ docker run -t --name olla -p 40114:40114 ghcr.io/thushan/olla:latest
Unable to find image 'ghcr.io/thushan/olla:latest' locally latest: Pulling from thushan/olla a280336c5efa: Pull complete c439f8aafbda: Download complete 4f4fb700ef54: Pull complete 28ca2a5cf8b5: Download complete 8066e87b2613: Pull complete Digest: sha256:e03b64577b3794d3b31d76640dc93d2bfaf184796915fcbbca37dc4fcff140f7 Status: Downloaded newer image for ghcr.io/thushan/olla:latest
$ curl http://localhost:40114/olla/models
{"object":"list","data":[{"olla":{"family":"","variant":"","parameter_size":"","quantization":"","aliases":["ai/gpt-oss"],"availability":[{"endpoint":"local-docker","state":"unknown"}],"capabilities":["text-generation"]},"id":"ai/gpt-oss","object":"model","owned_by":"olla","created":1762542266}]}

Product Overview

Built as a dedicated LLM proxy, Olla provides unique features for unparalleled performance and efficiency in managing your AI infrastructure in a single binary.

Unified Model Registry

Discover and access models across all your inference backends through a single unified API. Olla automatically aggregates models from Ollama, LM Studio, vLLM and other OpenAI-compatible endpoints.

Model Discovery

Intelligent Load Balancing

Distribute requests efficiently across multiple backends with priority-based, round-robin and least-connections strategies. Automatic failover & recovery ensures high availability and minimal downtime.

Load Balancing

High Performance

Built for speed with connection pooling, object pooling and lock-free statistics. Olla delivers low-latency responses even under heavy load.

Performance

Health Monitoring

Circuit breakers and health checks automatically detect and route around unhealthy backends. Real-time metrics provide visibility into your infrastructure.

Health Monitoring

Security & Rate Limiting

Built-in rate limiting and request validation to protect your infrastructure. Comprehensive audit logging tracks all requests.

Security

Observability

Comprehensive metrics and structured logging provide deep insights into your LLM infrastructure performance and behaviour.

Observability

Blazing Fast Performance

Engineered for speed with minimal resource usage. Olla delivers enterprise-grade performance from a single binary that runs anywhere.

Structured Logs

Real-time visibility into Olla's operations with comprehensive (compact) structured logging. Monitor startup, endpoint health checks and request processing with millisecond precision.

olla
╔─────────────────────────────────────────────────────────╗
β”‚                                      β €β €β£€β£€β €β €β €β €β €β£€β£€β €β €    β”‚
β”‚                                      ⠀⒰⑏Ⓓ⑆⠀⠀⠀⒰⑏Ⓓ⑆⑀   β”‚
β”‚   β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•— β–ˆβ–ˆβ•—     β–ˆβ–ˆβ•—      β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—    β €β’Έβ‘‡β£Έβ‘·β Ÿβ ›β »β’Ύβ£‡β£Έβ‘‡    β”‚
β”‚  β–ˆβ–ˆβ•”β•β•β•β–ˆβ–ˆβ•—β–ˆβ–ˆβ•‘     β–ˆβ–ˆβ•‘     β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•—   β’ β‘Ύβ ›β ‰β β €β €β €β ˆβ ‰β ›β’·β‘„   β”‚
β”‚  β–ˆβ–ˆβ•‘   β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•‘     β–ˆβ–ˆβ•‘     β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•‘   ⣿⠀⒀⣄⒀⣠⣀⣄⑀⣠⑀⠀⣿   β”‚
β”‚  β–ˆβ–ˆβ•‘   β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•‘     β–ˆβ–ˆβ•‘     β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•‘   β’»β£„β ˜β ‹β‘žβ ‰β’€β ‰β’³β ™β ƒβ’ β‘Ώβ‘€ β”‚
β”‚  β•šβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•”β•β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—β–ˆβ–ˆβ•‘  β–ˆβ–ˆβ•‘   β£Όβ ƒβ €β €β ³β €β ¬β €β žβ €β €β ˜β£·   β”‚
β”‚                                      β’Έβ‘Ÿβ €β €β €β €β €β €β €β €β €β’Έβ‘‡    β”‚
β”‚  github.com/thushan/olla   v0.0.21   β’Έβ‘…β €β €β €β €β €β €β €β €β €β’€β‘Ώ    β”‚
β•šβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β•
β–ˆ

Supported Backends

Olla provides integrations for a variety of backends.

See Integrations
OllamaLM StudiovLLMLiteLLMOpenAI-likeLemonadeSGLangLlamacppik_llama.cppvllm-gaudi

Frequently Asked Questions

Who is Olla designed for?

Olla is perfect for small businesses, development teams, startups and growing organisations that need reliable, production-grade AI routing without enterprise complexity. Ideal for teams getting started with AI infrastructure or running small to medium-scale deployments. For enterprise-scale operations, consider FoundryOS launching in Q2 2026.

What is Olla?

Olla is a high-performance, open-source LLM proxy and load balancer that intelligently routes requests across multiple inference backends including Ollama, LM Studio, vLLM and SGLang. It provides unified model discovery, automatic failover and comprehensive observability for AI infrastructure.

How do I install Olla?

Olla can be installed using Docker or Podman with a single command: docker run -t --name olla -p 40114:40114 ghcr.io/thushan/olla:latest. This pulls the latest image and starts the proxy on port 40114 (4 OLLA), see our Quickstart Guide.

Is Olla free and open source?

Yes, Olla is completely free and open source software under the Apache License, Version 2.0. The source code is available on GitHub.

What license is Olla distributed as?

Olla is distributed under the Apache License, Version 2.0. Learn about Olla Licensing.

What inference backends does Olla support?

Olla supports multiple OpenAI-compatible inference backends including Ollama, LM Studio, vLLM, SGLang and any others natively as well as OpenAI API-compatible endpoints. It provides a unified interface across all these backends. Learn more about Olla Integrations.

What are the main features of Olla?

Olla provides unified model registry, intelligent load balancing with priority-based routing, automatic failover and recovery, health monitoring with circuit breakers, built-in rate limiting and security features and comprehensive observability with structured logging and metrics. Olla also has an extensive list of other features you can learn more in the Olla Concepts.

Does Olla work with local LLM deployments?

Yes, Olla is specifically designed to work with local LLM deployments. It can route requests across local inference nodes running Ollama, LM Studio, or other local backends, providing load balancing and failover capabilities for self-hosted AI infrastructure.

OPEN SOURCE

Join the Olla Community

Olla is open source and growing fast. Perfect for small businesses, development teams, and organisations building reliable, unified AI infrastructure at any scale.

Active Development

100+ Stars
3.4K+ Downloads

Growing community with regular updates and improvements.

Star on GitHub

Contribute

We welcome contributions of all sizes. Every improvement matters!

  • πŸ› Report bugs and issues
  • ✨ Suggest new features
  • πŸ“ Improve documentation
  • πŸ”§ Submit pull requests
Contributing Guide

Perfect For

  • Small Businesses Reliable AI infrastructure for growing teams
  • Development Teams Unified API for multiple AI backends
  • Startups Build without cloud dependencies
  • Homelabs & Researchers Test and compare different models

Ready to optimise your LLM infrastructure?

Get started with Olla today and experience intelligent load balancing for your AI workloads.