13+ Model Families
Llama, Qwen, Mistral, Gemma, Phi, DeepSeek and more. Broad coverage of the open-weight ecosystem with ongoing additions as new architectures emerge.
ForgeCUDA-native LLM inference powered by the Inferno engine. A single Rust binary replaces Python, PyTorch and runtime complexity. 13+ model families, 8 quantisation formats, 25 hand-tuned CUDA kernels. OpenAI and Anthropic API compatible from the start.
Built from the ground up in Rust for NVIDIA hardware. No abstraction tax. Every microsecond is accounted for.
Llama, Qwen, Mistral, Gemma, Phi, DeepSeek and more. Broad coverage of the open-weight ecosystem with ongoing additions as new architectures emerge.
BF16, FP8, INT4/GPTQ, INT4/AWQ, NVFP4, MXFP4 and beyond. Native FP8 GEMM via cublasLt for H100 and Blackwell hardware delivers best-in-class throughput.
Flash-attention prefill, paged KV cache and sliding window attention combine to deliver maximum throughput with minimal memory overhead on modern NVIDIA hardware.
EAGLE-3 and n-gram self-speculative decoding reduce latency on generation-heavy workloads without changing model outputs. Plug in and go.
A pushdown automaton guarantees valid JSON output on every request. Structured data extraction and tool-call schemas work reliably at inference time.
No Python runtime. No PyTorch. One statically-linked binary you deploy anywhere CUDA is present. OpenAI and Anthropic compatible HTTP API out of the box.
Forge is a single-GPU inference engine. Other TensorFoundry products orchestrate, route, or build on top of it.
Forge is the engine. FoundryOS orchestrates fleets of engines across nodes. Run Forge on a single H100; run FoundryOS when you need distributed inference at scale.
Forge runs the model locally. Alloy is a gateway that routes to external LLM providers and manages API keys, budgets and access policies across your team.
Olla is an open-source proxy that routes across multiple backends. Forge is the backend itself. Use Olla in front of Forge to add intelligent routing and health monitoring.
Forge targets the NVIDIA CUDA ecosystem. Built and validated on RTX PRO 6000 and H100.
Forge is in active development. Join the waitlist to be first in line for EAP Q3 2026.