EAP Q3 2026
Forge

Pure Rust. Pure Speed.

CUDA-native LLM inference powered by the Inferno engine. A single Rust binary replaces Python, PyTorch and runtime complexity. 13+ model families, 8 quantisation formats, 25 hand-tuned CUDA kernels. OpenAI and Anthropic API compatible from the start.

What Forge Delivers

Built from the ground up in Rust for NVIDIA hardware. No abstraction tax. Every microsecond is accounted for.

13+ Model Families

Llama, Qwen, Mistral, Gemma, Phi, DeepSeek and more. Broad coverage of the open-weight ecosystem with ongoing additions as new architectures emerge.

8 Quantisation Formats

BF16, FP8, INT4/GPTQ, INT4/AWQ, NVFP4, MXFP4 and beyond. Native FP8 GEMM via cublasLt for H100 and Blackwell hardware delivers best-in-class throughput.

Flash Attention & Paged KV

Flash-attention prefill, paged KV cache and sliding window attention combine to deliver maximum throughput with minimal memory overhead on modern NVIDIA hardware.

Speculative Decoding

EAGLE-3 and n-gram self-speculative decoding reduce latency on generation-heavy workloads without changing model outputs. Plug in and go.

Constrained JSON Decoding

A pushdown automaton guarantees valid JSON output on every request. Structured data extraction and tool-call schemas work reliably at inference time.

Single Rust Binary

No Python runtime. No PyTorch. One statically-linked binary you deploy anywhere CUDA is present. OpenAI and Anthropic compatible HTTP API out of the box.

13+ Model Families
8 Quantisation Formats
25 CUDA Kernels
0 Python Dependencies

How Forge Fits the Stack

Forge is a single-GPU inference engine. Other TensorFoundry products orchestrate, route, or build on top of it.

FoundryOS logo

Forge vs FoundryOS

Backend vs Orchestrator

Forge is the engine. FoundryOS orchestrates fleets of engines across nodes. Run Forge on a single H100; run FoundryOS when you need distributed inference at scale.

Alloy logo

Forge vs Alloy

Inference vs Gateway

Forge runs the model locally. Alloy is a gateway that routes to external LLM providers and manages API keys, budgets and access policies across your team.

Olla logo

Forge vs Olla

Inference vs Proxy

Olla is an open-source proxy that routes across multiple backends. Forge is the backend itself. Use Olla in front of Forge to add intelligent routing and health monitoring.

Designed for NVIDIA

Forge targets the NVIDIA CUDA ecosystem. Built and validated on RTX PRO 6000 and H100.

RTX PRO 6000 Blackwell workstation
H100 SXM Data centre flagship
H100 PCIe Data centre
RTX 4090 Consumer flagship

Get Preview Access to Forge

Forge is in active development. Join the waitlist to be first in line for EAP Q3 2026.