Back to Blog

Run MLX Behind Olla on Your Mac

There are at least eight ways to serve an MLX model with an OpenAI API now, and that is before you count the Ollama and llama.cpp boxes you already have. The Mac local-AI stack has quietly fragmented. This guide shows how to stop caring which server you picked: put Olla in front of oMLX and the rest, and your applications talk to one endpoint while the backends underneath stay free to change.

If you have not met Olla yet, it is our open-source LLM proxy: a single Go binary you point at all your inference backends so your clients see one address. The Olla product page covers the whole picture, and what is an LLM proxy? explains the pattern from scratch. This guide is specifically about the Apple Silicon side: MLX, oMLX, and mixing them with everything else.

For the background on MLX itself - why it is fast, the Neural Engine myth, how it compares to llama.cpp - start with our MLX on Apple Silicon guide.

The problem: your Mac AI stack is fragmented#

A realistic setup these days looks something like this. You have a Mac Studio running oMLX for the MLX-native models because it is the fastest path on Apple Silicon. You have an old box with Ollama on it for the GGUF models that have day-one availability. Maybe there is an LM Studio install on the laptop too. Each speaks a slightly different dialect, lists models under different names, and lives at a different address.

So every application you write now has to know about all of them. When one is down, the app breaks. When you add a backend, you edit code. And when you want to run Claude Code or another Anthropic-shaped tool against a local model, you discover not every backend speaks Anthropic.

This is exactly the job a proxy does. One endpoint in front of everything, health checking and failover underneath, model-aware routing so the right request lands on the right backend. Your code stops caring what is running where.

What Olla does, briefly#

Olla sits in front of your backends and gives you a single OpenAI-compatible (and Anthropic-compatible) endpoint on port 40114. It discovers what models each backend has and unifies them into one catalogue, routes by priority and load, health-checks every endpoint with circuit breakers, and pins multi-turn conversations to the same backend so the KV cache stays warm. It does all this in well under 50MB of RAM, as a single binary with one YAML config.

Olla routing to multiple backendsAn app connects to a single Olla endpoint, which routes requests to oMLX, Ollama and LM Studio backends.Your appOpenAI / AnthropicOlla:40114oMLXOllamaLM Studio
Your app talks to one Olla endpoint. Olla routes each request to whichever backend has the model, so the backends behind it can change without your app noticing.

The relevant part for this guide: Olla treats oMLX as a first-class backend, with native support since v0.0.28.

Adding oMLX, native since v0.0.28#

oMLX is a good match for a proxy because it is multi-model. Unlike a single-model server, one oMLX instance serves many models from one process, loading them on demand and evicting the least-recently-used when memory runs tight. It also keeps a tiered KV cache - hot blocks in RAM, cold blocks persisted to SSD - so a long context survives a restart instead of being recomputed. That makes it practical behind a coding agent.

Because oMLX is multi-model, you want one Olla endpoint per oMLX instance, not one per model. Here is the whole configuration:

discovery:
  static:
    endpoints:
      - url: "http://localhost:8000"
        name: "local-omlx"
        type: "omlx"
        priority: 75
        model_url: "/v1/models"
        health_check_url: "/health"
        check_interval: 5s
        check_timeout: 2s

That is it. The type: "omlx" tells Olla to use its oMLX profile; the rest is the standard endpoint shape. Olla queries /v1/models, discovers everything oMLX has loaded or can load, and exposes it all through the /olla/omlx prefix:

curl http://localhost:40114/olla/omlx/v1/chat/completions 
  -H "Content-Type: application/json" 
  -d '{
    "model": "Qwen2.5-7B-Instruct-4bit",
    "messages": [{"role": "user", "content": "Hello"}],
    "max_tokens": 500
  }'

Mixing MLX and GGUF#

This is what a proxy is really for. Say you want oMLX for the MLX-native models on your Mac Studio, and Ollama for GGUF models you can pull the moment they are released. Add both:

discovery:
  static:
    endpoints:
      - url: "http://mac-studio:8000"
        name: "studio-omlx"
        type: "omlx"
        priority: 90
        model_url: "/v1/models"
        health_check_url: "/health"
        check_interval: 5s

      - url: "http://nuc:11434"
        name: "nuc-ollama"
        type: "ollama"
        priority: 80
        model_url: "/api/tags"
        health_check_url: "/"
        check_interval: 5s

proxy:
  engine: "olla"
  load_balancer: "priority"

model_registry:
  enable_unifier: true
  unification:
    enabled: true

The catch with mixing backends is naming. The same model family is gpt-oss:120b on Ollama, something like gpt-oss-120b-MLX under MLX, and a GGUF filename on llama.cpp. Three backends, three names for one model. Olla's model unification gives you a single /v1/models catalogue across all of them, and aliases let you collapse the variants under one virtual name:

model_aliases:
  gpt-oss-120b:
    - gpt-oss-120b-MLX          # oMLX / MLX format
    - gpt-oss:120b              # Ollama format
    - gguf_gpt_oss_120b.gguf    # llama.cpp GGUF file

Now your client asks for gpt-oss-120b and Olla works out which backend can serve it, rewrites the name to whatever that backend expects, and routes accordingly. One name, three backends, no client changes. Set the MLX box to a higher priority and it wins when the model is available on both, falling through to Ollama otherwise.

OpenAI and Anthropic, both first-class#

Olla speaks both protocols, and that matters more than it sounds. Most local tooling is OpenAI-shaped, but a growing share of the good agentic tools - Claude Code, OpenCode, Crush - are Anthropic-shaped. Olla gives you one endpoint for each: /olla/openai (or just /olla/proxy) for OpenAI clients, /olla/anthropic for Anthropic ones, both routing to the same backends underneath.

The Anthropic side is something we have put real work into recently, because getting it right is fiddly. When a backend implements the Anthropic Messages API natively - oMLX does - Olla detects that and forwards straight through in passthrough mode, skipping the Anthropic-to-OpenAI-to-Anthropic round trip entirely (you will see an X-Olla-Mode: passthrough header on the response). When a backend only speaks OpenAI, Olla does the full translation in both directions, including streaming, so the same /olla/anthropic endpoint works regardless of what is behind it.

export ANTHROPIC_BASE_URL="http://localhost:40114/olla/anthropic"
claude

That is Claude Code running entirely against a local MLX model on your own Mac. The cache-friendly KV tiering of oMLX underneath keeps multi-turn, tool-heavy sessions cheap to resume, and oMLX's Claude Code context-scaling pairs nicely here too - it rescales reported token counts so auto-compaction fires at the right time on smaller-context models. Run that way, the agent behaves like the real thing rather than a rough local approximation.

Scaling across Macs#

Because each oMLX instance is multi-model, scaling out is just more endpoints. Put a Mac Studio and a Mac mini behind Olla, give the faster one a higher priority, and let the olla engine's priority load balancer prefer it:

A two-Mac setup

Mac Studio

priority 90
Your fastest hardware with the most unified memory. Keeps the heavy models resident and takes traffic first.

Mac mini

priority 80
Picks up overflow and serves as failover. If the Studio is busy or down, requests land here automatically.

If a Mac goes offline, Olla's circuit breaker stops sending it traffic and keeps probing in the background; when it returns, Olla re-discovers its models in case anything changed and traffic flows again. One machine dropping out does not take your endpoint with it.

What the proxy costs you#

Any proxy adds something to the hot path, so the question is how much. Olla's endpoint selection uses lock-free atomic stats and runs in well under a millisecond, and its streaming-first design pushes tokens through with minimal buffering. Against the seconds a model spends generating, the routing overhead is rounding error. Our Olla performance benchmarks have the numbers if you want to see them measured rather than asserted.

What you get for that sub-millisecond cost: one endpoint, health checks, failover, model unification across MLX and GGUF naming, and Anthropic passthrough. For a stack that would otherwise have your applications juggling four addresses and three naming schemes, that is a good trade.

One endpoint, whatever you run behind it

The Mac inference landscape is going to keep moving - new MLX servers, new model formats, new chips. The value of a thin proxy is that none of that reaches your applications. Point Olla at whatever you are running today, and when you swap oMLX for the next fast thing, or add a GPU box, your clients never notice. For larger GPU and data-centre deployments, FoundryOS is the heavier counterpart Olla works alongside.


Key takeaways

  • The Mac AI stack has fragmented - several MLX servers, plus your existing Ollama and llama.cpp. A proxy gives you one endpoint over all of it.
  • Olla supports oMLX natively since v0.0.28. One endpoint with type: "omlx", and because oMLX is multi-model you need one endpoint per instance, not per model.
  • Model unification and aliases let one virtual name map onto the differently-named MLX and GGUF variants, so clients never see the difference.
  • Anthropic passthrough means Claude Code runs against a local MLX model with no translation round trip.
  • The proxy overhead is sub-millisecond, dwarfed by inference time, in exchange for failover, health checks and a stable endpoint.

Further reading#