Back to Blog

Deploying LLMs on Your Own Infrastructure - A Practical Guide

Running LLMs on your own infrastructure#

Running large language models on your own infrastructure is now a practical default for many teams, not an exotic choice reserved for companies with dedicated ML platform teams. The tooling has matured, the hardware options have widened, and the case for doing it has strengthened as cloud API costs compound at scale, privacy requirements tighten, and vendor dependencies become liabilities. This guide covers the whole picture: why teams choose to self-host, how the stack is built layer by layer, what hardware decisions actually hinge on, and how to decide when to stop adding complexity and when to grow into more of it.

Why self-host?#

The decision to run inference on your own infrastructure is usually driven by one or more of four things: cost, data control, latency, or independence.

Cost at scale#

Cloud APIs are priced per token. That is entirely sensible when you are prototyping or running low volumes - you pay for what you use and avoid provisioning anything. The trade-off shifts as call volume grows. Once a workload is steady and predictable, the per-token cost often exceeds the amortised cost of owning or leasing GPU capacity. The crossover point depends on model size, quantisation, hardware cost, and utilisation, but for many production workloads it arrives earlier than teams expect. The cost comparison between self-hosted and cloud API inference covers this in detail.

Data privacy and sovereignty#

This is often the reason that removes optionality entirely. If your data is subject to the Australian Privacy Act, GDPR, or sector-specific regulation (healthcare, finance, legal), sending it to a third-party API may not be acceptable - or may require specific contractual and technical controls that are difficult to enforce in practice. Self-hosting keeps inference on infrastructure you control. Your prompts, completions, and whatever context you are injecting into them never leave your network.

For Australian businesses handling personal information, the Privacy Act 1988 is specific about overseas disclosure: the disclosing entity remains accountable for what happens to that data. Data residency - the requirement that certain data not leave a particular jurisdiction - is a direct argument for on-premise or private-cloud inference.

Latency and control#

Cloud API latency includes the round-trip to the provider's servers and is subject to their load, rate limits, and quota policies. On your own hardware, the model is a few milliseconds away. You control the hardware, the software stack, the model weights, and the concurrency limits. When you have a latency-sensitive workload - a real-time coding assistant, an interactive document editor, anything where the user is waiting - that control matters.

No vendor lock-in, no metering anxiety#

Tight coupling to a single cloud provider's API means their pricing changes, their outages, and their deprecation cycles all become your problem. Self-hosting trades that external dependency for an operational one - you own the uptime responsibility - but the operational dependency is one you can engineer around, monitor directly, and invest in proportionally.


The stack, layer by layer#

Self-hosted LLM infrastructure is not a single tool. It is a set of layers, each solving a different problem, each earning its place when the problem in front of you actually needs it. Most teams only need the first two layers for a long time.

The four layers of a self-hosted inference stack

Backends

Run the models. Accepts requests, loads weights, runs inference, returns tokens. One backend per machine or process. Examples: vLLM, llama.cpp, SGLang, Ollama & LM Studio.

Proxy

One stable endpoint. Routes requests across backends, health-checks each one, handles failover, and optionally manages sticky sessions for KV-cache affinity. Add this when you have two or more backends.

Gateway

Multi-tenant control. Authentication, API keys per team or customer, rate limiting, budget enforcement, audit logging. Add this when multiple teams or external customers share the same inference infrastructure.

Platform

Fleet orchestration. Scheduling, autoscaling, fleet health, node management across many inference machines. Add this when you are running more backends than you can manage by hand.

Here is how the layers relate, and what each one adds:

LayerWhat it solvesWhen you need it
BackendRuns the model; serves inference requestsAlways - this is the foundation
ProxyOne endpoint, routing, failover, KV-cache affinityAs soon as you have 2+ backends
GatewayPer-tenant keys, rate limits, budgets, audit trailsWhen multiple teams or customers share infrastructure
PlatformFleet scheduling, autoscaling, fleet-wide observabilityWhen backends outnumber what you can operate by hand
BackendsProxy (Olla)Gateway (Alloy)Platform (FoundryOS) add routing and failoveradd tenant controlsadd fleet orchestration

Backends: where inference happens#

The backend is the process that actually loads model weights and runs inference. Several open-source options have become the standard choices:

vLLM is the production-grade, high-throughput option. It uses PagedAttention for efficient KV cache management and supports continuous batching, which means it can handle many concurrent requests without the throughput collapse you get from naive batching. It is Python-based and designed for server deployments. If you are running on data centre-grade GPU hardware and need to saturate it efficiently, vLLM is the starting point.

SGLang is a newer inference framework with strong performance on structured generation workloads and multi-step reasoning chains. It is worth evaluating alongside vLLM for workloads that use structured output or complex agentic prompting patterns.

llama.cpp is the option that runs everywhere. Written in C++, it runs efficiently on CPU, on Apple Silicon via Metal, and on consumer GPUs. Its GGUF quantisation formats make it possible to run genuinely capable models on hardware that could never load the full-precision weights. It is the right choice for edge deployments, developer machines, and anywhere you need to run inference without a data centre GPU.

Ollama wraps llama.cpp (and other backends) in a package that is straightforward to install, ships with model management built in, and exposes an OpenAI-compatible API. It is the fastest way to get a local model running on macOS, Linux, or Windows. For development and team tooling - powering a shared coding assistant, a local chat interface - it is often the right default.

LM Studio provides a desktop GUI on top of llama.cpp and adds a local inference server mode. It is useful when the person running the model is not a developer and wants a visual interface to browse and manage models.

For a detailed comparison of these backends and guidance on which fits which use case, see our guide to LLM inference servers.

Proxy: one endpoint, routing, failover#

Once you have more than one backend - a GPU box in the office, an Ollama instance on a developer's machine, a Mac mini running llama.cpp - hardcoding your application to a single backend address becomes a liability. That backend goes down, restarts, or gets moved, and everything breaks. You also have no visibility into what is happening across the fleet.

An LLM proxy solves this by giving you one stable, OpenAI-compatible endpoint that your applications point to. The proxy knows about all your backends, health-checks each one continuously, and routes requests to a healthy backend automatically. When a backend fails, the proxy routes around it. When it recovers, it comes back into rotation.

Beyond basic failover, a good proxy handles:

  • Model-aware routing - sending requests for a given model to a backend that has it loaded
  • Priority-based routing - preferring the fast GPU backend and falling back to the CPU server only when needed
  • Sticky sessions - routing subsequent turns in a multi-turn conversation to the same backend so the KV cache stays warm
  • Per-backend observability - request counts, error rates, latency percentiles for each backend individually

Our primer on what an LLM proxy does explains the concept and when you actually need one.

Olla is our open-source LLM proxy and load balancer (Apache 2.0, Go, ~50MB RAM). It supports Ollama, LM Studio, vLLM, SGLang, llama.cpp, LiteLLM, LMDeploy, vLLM-MLX, Docker Model Runner, and LemonadeSDK as backends. Sub-millisecond endpoint selection, sticky sessions with KV-cache affinity, circuit breakers, and an OpenAI-compatible API. See the Olla documentation for configuration details.

If you are evaluating proxy options, our comparison of Olla and LiteLLM covers when each fits and when you might want both.

Gateway: multi-tenant control#

A proxy handles routing. A gateway handles control. The distinction matters when multiple teams or external customers share the same inference infrastructure and you need to enforce who can use what and how much.

A gateway adds:

  • API key management - issue per-team or per-user keys without exposing backend credentials
  • Rate limiting - enforce per-key or per-team request limits
  • Budget enforcement - set and alert on spend or usage ceilings per tenant
  • Audit logging - a full record of who sent what to which model, when

This is the right layer to add when self-hosted inference becomes a shared platform - an internal AI API that other teams or services consume, or a product where customers are using inference resources you provide.

Alloy is our enterprise gateway tier that sits above the proxy layer and adds these controls.

Platform: fleet orchestration#

At some point, the number of inference backends outgrows what you can operate by hand. You have a rack of GPU nodes, or multiple cloud instances, or a mix of on-premise and cloud capacity, and you need to schedule workloads across them, autoscale in response to demand, roll out new model versions without downtime, and monitor fleet-wide health from a single pane.

That is a platform problem, not a proxy or gateway problem. The tooling required is different: job scheduling, capacity planning, rolling deployments, health aggregation across nodes.

FoundryOS is our enterprise inference orchestration platform for exactly this problem - fleet management and monitoring for inference at scale. It is currently in early access. Most teams will not need it for a long time, and that is the right outcome.


Hardware sizing#

Getting hardware right is mostly about asking the right questions before you order anything.

Model size and quantisation#

The primary constraint is VRAM. A model's weights need to fit in GPU memory, plus headroom for the KV cache during inference. As a rough guide: a 7B parameter model at 4-bit quantisation needs roughly 4-5GB of VRAM, a 13B model needs around 8-10GB, and a 70B model at 4-bit needs roughly 35-40GB. These are order-of-magnitude figures - actual requirements vary by architecture, quantisation format, and context length.

Quantisation is the practical lever. The difference in output quality between an 8-bit and a 4-bit quantised version of the same model is often smaller than the difference in hardware requirements. llama.cpp's GGUF formats and vLLM's AWQ/GPTQ support make it straightforward to run capable models on consumer-grade hardware that would struggle with full-precision weights.

Apple Silicon is a genuine option here. The unified memory architecture means a Mac Studio with 192GB of RAM can run 70B models at useful speeds that a system with a modest discrete GPU cannot match. For teams without data centre GPU access, Apple Silicon is often the most cost-effective entry point for mid-size models.

Dense and Mixture-of-Experts models#

Not all large models are architecturally the same, and the distinction matters for hardware planning.

A dense model activates every one of its parameters for every token it processes. A 70B dense model means 70 billion weights in play for each forward pass. That is the number that drives both your VRAM requirement and your per-token compute cost.

A Mixture-of-Experts (MoE) model works differently. The total parameter count is large, but for each token the model routes computation through only a subset of specialised sub-networks called experts. Most of the weight matrix sits idle on any given token. The result: VRAM scales with the total parameter count (all expert weights must be loaded into memory), but per-token compute and latency scale with the much smaller active parameter count. MoE models are VRAM-hungry but punch well above their active-parameter weight on throughput.

The idea itself goes back further than most people realise. Jacobs, Jordan, Nowlan and Hinton introduced the concept in their 1991 paper "Adaptive Mixtures of Local Experts", but it was Shazeer et al.'s 2017 sparsely-gated MoE paper and the 2021 Switch Transformer (Fedus, Zoph, Shazeer) that made it practical at modern scale by showing you could train and route across thousands of experts efficiently. The insight that made it so compelling is straightforward: MoE decouples a model's total capacity (how much it can know, across all experts) from its per-token compute cost (how much work each token actually triggers), so you can grow capability dramatically without a proportional increase in inference cost.

Some approximate figures from current open-weight models to illustrate the gap - figures from official Hugging Face model cards, treat them as order-of-magnitude:

ModelTotal params (approx.)Active params per token (approx.)
Llama 4 Maverick (Meta, 2026)~400B~17B
Kimi K2 (Moonshot AI, 2025)~1T~32B
DeepSeek V4-Pro (DeepSeek, 2026)~1.6T~49B

The practical implication: quantisation still applies to MoE models, but even a 4-bit MoE with hundreds of billions of total parameters demands a lot of memory. That is what pushes you toward high-VRAM cards or multi-GPU servers, regardless of how efficient the active compute path is.

When evaluating an MoE model, check both numbers. A model advertised as "37B active" may have 600B+ total parameters and require GPU memory sized for the latter, not the former.

How we run it at TensorFoundry#

Most of our hardware at TensorFoundry internally runs on NVIDIA GPUs, in particular the NVIDIA RTX PRO 6000 Blackwell for workstations and several in our data center. With 96GB of VRAM, these GPUs can run 70B models at useful speeds on workstations, and on data center servers, we usually use 8xRTX6000 Server Editions to run larger models via Proxmox on AMD EPYC servers.

The 96GB per card matters specifically for the MoE models we run regularly. Holding the full expert weight set in VRAM - rather than paging weights from system memory - is what keeps time-to-first-token responsive. The 8x server configuration extends that to models where total weight exceeds what a single card can hold, distributing the expert layers across cards while keeping the active compute path fast.

Concurrency and batch size#

A single inference server serving one request at a time is a very different hardware requirement to one handling 20 concurrent requests with continuous batching. vLLM's PagedAttention is specifically designed to increase utilisation by batching requests efficiently; with it, a single A100 can handle workloads that would require multiple GPUs under naive request-at-a-time inference.

If your workload is bursty (many short requests from an interactive tool) versus sustained (long document processing jobs), that shapes the hardware and batching configuration differently.

Latency targets#

Time-to-first-token and tokens-per-second are the two numbers that matter for user experience. Time-to-first-token depends heavily on whether the KV cache is warm and on GPU memory bandwidth. Tokens-per-second depends on both memory bandwidth and compute throughput.

For interactive workloads where a person is reading completions as they arrive, time-to-first-token matters most - even a slow tokens-per-second feels acceptable if the first token appears quickly. For batch processing where you are waiting for complete responses, throughput is the metric to optimise for.


Networking, reliability and observability#

Infrastructure that works on a developer's machine and breaks in production usually lacks three things: a single stable entry point, health checking, and visibility into what is actually happening.

Single entry point#

Hardcoding application configuration to a specific backend address means every infrastructure change requires an application change. A proxy gives you one address that does not change, even as the backends behind it are added, removed, or moved. This is the first thing worth getting right.

Health checks and automatic failover#

GPU processes OOM. Containers restart. Machines reboot for kernel updates. Inference processes hang. Any of these will take a backend offline. Without health checking, the next request hits the dead backend and returns an error. With it, the proxy has already marked the backend unhealthy and is routing around it.

The difference for users is between a failed request and a slightly slower one routed to the next available backend.

Per-backend metrics#

Aggregate latency numbers tell you when something is wrong. Per-backend metrics tell you which backend is the problem. Knowing that your vLLM box has a p95 latency of 12 seconds while Ollama is responding normally lets you route around the issue immediately rather than diagnosing it blind.

The instrumentation built into a good proxy gives you this view without requiring separate monitoring setup for each backend.


Privacy, compliance and data sovereignty#

For regulated industries and any organisation handling personal information, the data privacy argument for self-hosting is often the one that closes the decision.

When you call a third-party API, your prompts, any retrieved context, and the completions all travel to and are processed on that provider's infrastructure. Depending on the provider, that may involve multiple regions, subprocessors, and retention policies that are difficult to verify and audit. Data residency requirements - common in healthcare, finance, government, and any business subject to GDPR or the Australian Privacy Act - may simply prohibit this pattern.

Self-hosted inference keeps data on infrastructure you control. The model runs on your hardware. Prompts never leave your network. You define the retention and audit policy, not the provider.

Data residency in practice

For Australian businesses: the Privacy Act 1988 (Cth) and the Australian Privacy Principles require that you take reasonable steps to ensure overseas recipients handle personal information consistently with those principles. In practice, this obligation is difficult to enforce through a contractual relationship with a large cloud provider. Running inference on infrastructure within Australian jurisdiction removes the overseas disclosure question entirely.

For EU businesses: GDPR's Chapter V governs transfers of personal data to third countries. The legal basis for transfers to major US cloud providers has been contested repeatedly. On-premise inference within the EU sidesteps that analysis.


A decision framework: when to self-host, and how to grow the stack#

The right architecture is the one that matches the problem you actually have today, with a clear view of the next rung when you need it.

Start with one backend

Run one backend. Ollama is usually the fastest starting point for a team that does not already have a GPU server configured. Point your application directly at it. You do not need anything else yet.

Add a proxy when you have two or more backends

Add a proxy when:

  • You have two or more backends, or you want failover between a primary and a fallback
  • More than one application is using the same inference infrastructure
  • You need per-backend visibility (latency, error rates) rather than a single opaque endpoint
  • You are using multi-turn conversations and want KV-cache-aware routing

At this point, deploy Olla (or another proxy) in front of your backends. Your applications talk to the proxy; the proxy manages the backends.

Add a gateway when tenancy requirements arrive

Add a gateway when:

  • Multiple teams within your organisation are using the same inference infrastructure and you need per-team API keys and usage attribution
  • You are building a product where external customers consume inference resources you provide
  • You need rate limiting, budget enforcement, or audit logging at the tenant level

Alloy is our gateway layer for this.

Add a platform when the fleet outgrows manual operation

Add a platform when:

  • You have more inference nodes than you can manage with per-machine configuration
  • You need scheduled scaling in response to load patterns
  • Rolling model updates, node health tracking, and fleet-wide observability have become operational requirements

FoundryOS is our platform for this stage (early access).

The right order matters. Do not stand up a fleet orchestration platform on day one. Do not add a gateway until you have tenancy requirements. Each layer adds operational surface area - only add it when the problem demands it.


Decision checklist

  • Self-hosting earns its keep at steady scale, when data residency is required, when latency matters, or when vendor independence is a priority.
  • Start with one backend. Ollama or vLLM depending on whether you need fast setup or high-throughput production serving.
  • Add a proxy (Olla, or similar) as soon as you have two backends or need a stable endpoint for more than one application.
  • Hardware sizing starts with VRAM. Know your model's quantised weight size, your concurrency requirements, and your latency target before ordering. For Mixture-of-Experts models, size VRAM against total parameters, not active parameters.
  • Data sovereignty is a hard requirement in many regulated industries. On-premise inference removes the overseas-disclosure question entirely.
  • Add a gateway (Alloy) only when multi-tenant key management and budget enforcement are actual requirements - not anticipated ones.
  • Add a platform (FoundryOS) only when the fleet outgrows per-machine operational management.
  • Cost: cloud APIs win at low volume; self-hosted wins at steady, predictable scale. Do the arithmetic for your specific workload.

Further reading#