12+ Model Families
Llama, Qwen, Mistral, Gemma 1 through 4, Phi, DeepSeek, GLM-4 and GPT-OSS. Dense, mixture-of-experts and sliding-window variants are all covered, with ongoing additions as new architectures emerge.
ForgeCUDA-native LLM inference powered by the Inferno engine as a single binary. 12+ model families, 9+ quantisation formats, 39 hand-tuned CUDA kernels. Native OpenAI and Anthropic API compatibility and extensive customisation.
Built from the ground up in Rust for NVIDIA hardware. No abstraction tax. Every microsecond is accounted for.
Llama, Qwen, Mistral, Gemma 1 through 4, Phi, DeepSeek, GLM-4 and GPT-OSS. Dense, mixture-of-experts and sliding-window variants are all covered, with ongoing additions as new architectures emerge.
BF16, FP8, INT4/GPTQ, INT4/AWQ, NVFP4, MXFP4 and TurboQuant4, our own Lloyd-Max codebook format with per-group Hadamard rotation. Native FP8 GEMM via cublasLt for H100/H200/B200 and Blackwell hardware delivers best-in-class throughput.
Flash-attention prefill, paged KV cache to 128K context and sliding window attention combine to deliver maximum throughput with minimal memory overhead. FP8 paged KV halves cache memory, roughly doubling block capacity on the same GPU.
EAGLE-3 and n-gram self-speculative decoding reduce latency on generation-heavy workloads without changing model outputs. Plug in and go.
A pushdown automaton guarantees valid JSON output on every request. Structured data extraction and tool-call schemas work reliably at inference time.
A supervisor thread watches the dedicated CUDA inference thread and restarts it automatically after a panic, reloading weights and resuming requests without operator intervention. A watchdog flags genuine GPU stalls through the health endpoint so load balancers can route around an unhealthy node.
A /metrics endpoint exposes latency histograms, per-error-class counters, queue depth and slot utilisation in standard Prometheus text format. Wire it straight into Grafana or any compatible scraper.
No Python runtime. No PyTorch. One statically-linked binary you deploy anywhere CUDA is present. OpenAI and Anthropic compatible HTTP API out of the box.
Forge is a single-GPU inference engine. Other TensorFoundry products orchestrate, route, or build on top of it.
Forge is the engine. FoundryOS orchestrates fleets of engines across nodes. Run Forge on a single H100/H200/B200/GB200; run FoundryOS when you need distributed inference at scale.
Forge runs the model locally. Alloy is a gateway that routes to external LLM providers and manages API keys, budgets and access policies across your team.
Olla is an open-source proxy that routes across multiple backends. Forge is the backend itself. Use Olla in front of Forge to add intelligent routing and health monitoring.
Forge targets the NVIDIA CUDA ecosystem. Built and validated on RTX PRO 6000, H100, H200, B200, and GB200.
Forge is a pure-Rust, CUDA-native LLM inference engine. It ships as a single statically-linked binary with no Python runtime and no PyTorch dependency. Forge exposes an OpenAI and Anthropic compatible HTTP API, supports 12+ model architecture families, 9+ quantisation formats and 39 hand-tuned CUDA kernels for maximum throughput on NVIDIA hardware.
Forge is designed for teams that need maximum inference performance on NVIDIA GPUs and want to eliminate Python runtime overhead. It suits AI infrastructure engineers who need predictable latency, GPU utilisation and a single binary they can deploy anywhere CUDA is present, without managing a Python environment or PyTorch installation.
Forge is targeting its Early Access Program in Q3 2026. Join the waitlist to be first in line and to provide feedback during the preview phase.
Forge supports 12+ model families including Llama, Qwen, Mistral, Gemma 1 through 4, Phi, DeepSeek, GLM-4 and GPT-OSS. Quantisation formats include BF16, FP8, INT4/GPTQ, INT4/AWQ, NVFP4, MXFP4 and TurboQuant4 (TensorFoundry's own Lloyd-Max codebook format with per-group Hadamard rotation). Native FP8 GEMM via cublasLt is available for H100/H200/B200 and Blackwell hardware.
Forge is the inference engine. Olla, TensorFoundry's open-source LLM proxy, can sit in front of Forge to add intelligent routing, health monitoring and failover across multiple Forge nodes. FoundryOS can orchestrate a fleet of Forge instances across a GPU cluster, providing Scout monitoring and Fleet-managed configuration distribution. Alloy can sit above all of these as the enterprise API gateway for controlling access and budgets.
Join the Forge preview waitlist to register your interest for the Q3 2026 Early Access Program. Join the waitlist.
Forge is in active development. Join the waitlist to be first in line for EAP Q3 2026.