EAP Q3 2026
Forge

Pure Rust. Pure Speed.

CUDA-native LLM inference powered by the Inferno engine as a single binary. 12+ model families, 9+ quantisation formats, 39 hand-tuned CUDA kernels. Native OpenAI and Anthropic API compatibility and extensive customisation.

What Forge Delivers

Built from the ground up in Rust for NVIDIA hardware. No abstraction tax. Every microsecond is accounted for.

12+ Model Families

Llama, Qwen, Mistral, Gemma 1 through 4, Phi, DeepSeek, GLM-4 and GPT-OSS. Dense, mixture-of-experts and sliding-window variants are all covered, with ongoing additions as new architectures emerge.

9+ Quantisation Formats

BF16, FP8, INT4/GPTQ, INT4/AWQ, NVFP4, MXFP4 and TurboQuant4, our own Lloyd-Max codebook format with per-group Hadamard rotation. Native FP8 GEMM via cublasLt for H100/H200/B200 and Blackwell hardware delivers best-in-class throughput.

Flash Attention & Paged KV

Flash-attention prefill, paged KV cache to 128K context and sliding window attention combine to deliver maximum throughput with minimal memory overhead. FP8 paged KV halves cache memory, roughly doubling block capacity on the same GPU.

Speculative Decoding

EAGLE-3 and n-gram self-speculative decoding reduce latency on generation-heavy workloads without changing model outputs. Plug in and go.

Constrained JSON Decoding

A pushdown automaton guarantees valid JSON output on every request. Structured data extraction and tool-call schemas work reliably at inference time.

Supervised Engine Recovery

A supervisor thread watches the dedicated CUDA inference thread and restarts it automatically after a panic, reloading weights and resuming requests without operator intervention. A watchdog flags genuine GPU stalls through the health endpoint so load balancers can route around an unhealthy node.

Prometheus Metrics Built In

A /metrics endpoint exposes latency histograms, per-error-class counters, queue depth and slot utilisation in standard Prometheus text format. Wire it straight into Grafana or any compatible scraper.

Single Rust Binary

No Python runtime. No PyTorch. One statically-linked binary you deploy anywhere CUDA is present. OpenAI and Anthropic compatible HTTP API out of the box.

12+ Model Families
9+ Quantisation Formats
39 CUDA Kernels
0 Python Dependencies

How Forge Fits the Stack

Forge is a single-GPU inference engine. Other TensorFoundry products orchestrate, route, or build on top of it.

FoundryOS logo

Forge vs FoundryOS

Backend vs Orchestrator

Forge is the engine. FoundryOS orchestrates fleets of engines across nodes. Run Forge on a single H100/H200/B200/GB200; run FoundryOS when you need distributed inference at scale.

Alloy logo

Forge vs Alloy

Inference vs Gateway

Forge runs the model locally. Alloy is a gateway that routes to external LLM providers and manages API keys, budgets and access policies across your team.

Olla logo

Forge vs Olla

Inference vs Proxy

Olla is an open-source proxy that routes across multiple backends. Forge is the backend itself. Use Olla in front of Forge to add intelligent routing and health monitoring.

Designed for NVIDIA

Forge targets the NVIDIA CUDA ecosystem. Built and validated on RTX PRO 6000, H100, H200, B200, and GB200.

RTX 4090 Developer workstation
RTX 5090 High-end AI workstation
RTX 6000 Ada Professional AI workstation
RTX PRO 6000 Blackwell Next-gen workstation AI
H100 Datacentre AI accelerator
H200 Enterprise AI infra
B200 Hyperscale AI accelerator
GB200 NVL72 Frontier AI supercluster

Frequently Asked Questions

What is Forge?

Forge is a pure-Rust, CUDA-native LLM inference engine. It ships as a single statically-linked binary with no Python runtime and no PyTorch dependency. Forge exposes an OpenAI and Anthropic compatible HTTP API, supports 12+ model architecture families, 9+ quantisation formats and 39 hand-tuned CUDA kernels for maximum throughput on NVIDIA hardware.

Who is Forge designed for?

Forge is designed for teams that need maximum inference performance on NVIDIA GPUs and want to eliminate Python runtime overhead. It suits AI infrastructure engineers who need predictable latency, GPU utilisation and a single binary they can deploy anywhere CUDA is present, without managing a Python environment or PyTorch installation.

When will Forge be available?

Forge is targeting its Early Access Program in Q3 2026. Join the waitlist to be first in line and to provide feedback during the preview phase.

What model architectures and quantisation formats does Forge support?

Forge supports 12+ model families including Llama, Qwen, Mistral, Gemma 1 through 4, Phi, DeepSeek, GLM-4 and GPT-OSS. Quantisation formats include BF16, FP8, INT4/GPTQ, INT4/AWQ, NVFP4, MXFP4 and TurboQuant4 (TensorFoundry's own Lloyd-Max codebook format with per-group Hadamard rotation). Native FP8 GEMM via cublasLt is available for H100/H200/B200 and Blackwell hardware.

How does Forge fit with the rest of the TensorFoundry stack?

Forge is the inference engine. Olla, TensorFoundry's open-source LLM proxy, can sit in front of Forge to add intelligent routing, health monitoring and failover across multiple Forge nodes. FoundryOS can orchestrate a fleet of Forge instances across a GPU cluster, providing Scout monitoring and Fleet-managed configuration distribution. Alloy can sit above all of these as the enterprise API gateway for controlling access and budgets.

How do I get early access to Forge?

Join the Forge preview waitlist to register your interest for the Q3 2026 Early Access Program. Join the waitlist.

Get Preview Access to Forge

Forge is in active development. Join the waitlist to be first in line for EAP Q3 2026.