EAP Q3 2026
FoundryOS

Scalable. Secure. Private.

Deploy, unify and scale your local-AI infrastructure - powered by vLLM, SGLang or LlamaCpp - with cloud-level reliability and the control, monitoring and privacy of self-hosting. Run it on private cloud, air-gapped systems or on-premises hardware.

CursorOpenAI APIClaude CodeAnthropic APIOpen WebUIOpenAI APIYour AppOpenAI APIFoundryOSAPI TranslationPass-throughLoad BalancingIntelligent routingHealth MonitoringAll systems operationalRequest RoutingModel-aware routingOpenAI APIClaude APIvLLM Instance 1phi-435%vLLM Instance 2phi-472%SGLangglm-4-645%llama.cppunsloth-qwen328%

Architecture Overview

FoundryOS is built on a distributed architecture with core components working in harmony to deliver enterprise-grade AI inference at scale. Built with Go 1.26 for maximum performance.

Fleet

The central orchestration layer that coordinates your entire AI infrastructure. Fleet tracks node and endpoint health, model registration and routing state, and distributes configuration to every node. State persists through a pluggable store, from a single JSON file to Redis.

Scout

Lightweight monitoring agents deployed on each inference node. Scout collects real-time metrics including GPU utilisation, memory usage and latency percentiles (p50/p90/p99) for comprehensive observability.

Relay

The LLM proxy that applications talk to. Relay accepts OpenAI or Anthropic API requests, resolves the target model and selects the healthiest backend with load-aware routing. Circuit breakers and exponential backoff keep requests flowing when backends come under pressure.

Deck

A browser-based operator console embedded in Fleet and served on the same port. Deck gives a live cluster view with event streaming, so operators can inspect nodes, endpoints and models, drain or re-enable backends and trigger an immediate health re-probe without touching the CLI.

Enterprise Ready Inference

FoundryOS is built for teams deploying AI infrastructure on internal clouds, private networks, or airgapped systems. With native vLLM, SGLang, llama.cpp and Aphrodite support and robust features to help you deploy, manage and scale your AI workloads efficiently. With low-overhead and high-performance at scale.

  • Container-based deployment (not SaaS)
  • Monitoring and management tools for realtime analysis
  • Air-gapped and on-premises / own-cloud support
  • Enterprise GPU optimisation
  • Built with Go 1.26

Unified API

FoundryOS unifies multiple inference backends and models under a single API with protocol-level translation. Applications and users can query using either OpenAI or Anthropic API formats - FoundryOS handles the translation at the protocol level, preserving streaming, tool use and all advanced features.

  • Protocol-level API translation (not just format conversion)
  • Query using OpenAI or Anthropic API formats interchangeably
  • Full streaming and tool use support across translations
  • Support Claude Code, Cursor with on-premises models easily
  • Automatic request/response transformation

Model Unification

FoundryOS unifies your AI models across multiple inference backends under a single API and management plane. Seamlessly switch between vLLM, SGLang, llama.cpp and Aphrodite, validate which backend works best for your workload.

  • Multi-backend model unification
  • Test & validate models across backends
  • Run-time backend switching with zero downtime
  • Provide model redundancy across multiple nodes

Native Inference Backends

FoundryOS integrates natively with inference backends to provide monitoring and management capabilities.

vLLM
SGLang
llama.cpp
Aphrodite
TensorRT-LLM coming soon

Health Monitoring

FoundryOS provides intelligent health checks with circuit breakers and exponential backoff to ensure your AI infrastructure stays online and available for your users, applications and customers.

  • Circuit breakers with configurable failure thresholds
  • Exponential backoff to prevent cascading failures
  • Half-open probing restores an endpoint once it passes health checks again
  • Unhealthy backends are removed from routing automatically until they recover
  • Real-time health dashboards in the Deck operator console

Inference Observability

Get complete visibility into your AI infrastructure with comprehensive observability. Scout agents collect detailed metrics from every node, providing real-time insights into performance and resource utilisation.

  • GPU utilisation and memory tracking per node
  • Latency percentiles: p50, p90 and p99 for SLA compliance
  • Tokens per second and throughput metrics
  • Prometheus /metrics endpoints on Fleet, Scout and Relay
  • Model performance analysis with usage attribution

Enterprise Control

FoundryOS keeps configuration consistent across your entire AI infrastructure. Fleet distributes versioned per-node overlays to every Scout and Relay over the existing gRPC stream, and persists state through a pluggable store that scales from a single JSON file to Redis.

  • Token authentication with role-based access control (RBAC) coming soon
  • Pluggable distributed state: JSON file, in-memory or Redis
  • Centralised config distribution with versioned per-node overlays
  • Overlays applied live, with hot-reload and no node restart
  • Per-node tokens issued at registration, stable across Fleet restarts

Frequently Asked Questions

What is FoundryOS?

FoundryOS is an enterprise-grade platform for deploying, unifying and scaling local AI infrastructure. It consists of three core components: Fleet, the central orchestration layer that manages cluster state and configuration distribution; Scout, lightweight monitoring agents deployed on each inference node; and Relay, the LLM proxy that accepts OpenAI or Anthropic API requests and routes them with load-aware logic. Deck, a browser-based operator console, is embedded in Fleet for live cluster management.

Who is FoundryOS designed for?

FoundryOS is designed for enterprise teams running AI inference at scale on private cloud, air-gapped systems or on-premises hardware. It suits organisations that need distributed GPU fleet management, need cloud-level reliability without cloud lock-in, or must keep data on their own infrastructure for compliance and security reasons.

When will FoundryOS be available?

FoundryOS is targeting its Early Access Program in Q3 2026. Join the waitlist to be notified when access opens and to influence the product roadmap directly.

What inference backends does FoundryOS support?

FoundryOS provides native support for vLLM, SGLang, llama.cpp and Aphrodite. TensorRT-LLM support is planned. Relay, the built-in LLM proxy, handles OpenAI and Anthropic API translation at the protocol level, preserving streaming and tool use across all supported backends.

How does FoundryOS differ from Olla?

Olla is an open-source, single-binary LLM proxy suited for small to medium deployments. FoundryOS is the enterprise platform for distributed, multi-node GPU fleets. Fleet orchestrates cluster state, Scout collects per-node telemetry including GPU utilisation and p50/p90/p99 latency metrics, and Relay handles protocol-level API translation. If you are managing a single server or a small team, Olla is the right starting point. When you need fleet-scale management, FoundryOS is the next step.

How do I get early access to FoundryOS?

Join the FoundryOS waitlist to register your interest for the Q3 2026 Early Access Program. Join the waitlist.

Join the FoundryOS Waitlist

Be among the first to deploy enterprise AI infrastructure when FoundryOS launches. EAP Q3 2026.