Self-Hosted LLM vs Cloud API - A Cost Framework
The 90% saving claim#
The core tension is simple: cloud APIs charge per token, which is excellent at low and unpredictable volumes but expensive at scale. Self-hosted hardware has high upfront costs and ongoing overheads, but the marginal cost per token approaches zero once the machine is running. At some point those two cost curves cross. Where they cross is what you need to find.
Here is what we will work through:
How cloud API costs actually compound
Per-token billing is cheap at low volumes and expensive at scale. Understanding the formula and the variables that move it is the first step.How to model self-hosting costs honestly
Hardware amortisation, power draw, and operational time - all three cost buckets need to be on the table before any comparison is meaningful.The break-even calculation
A transparent worked example with all assumptions stated explicitly - use the formula with your own numbers, not ours.Non-cost factors that often decide first
Data privacy, latency, vendor dependency, and absence of per-call anxiety frequently settle the question before the spreadsheet runs.How to get more value from hardware you already own
Utilisation matters as much as hardware cost. Better routing and scheduling often improve your economics before more cards do.The cloud side: per-token billing#
Cloud inference APIs bill on a simple per-million-tokens model. The formula is:
monthly_cloud_cost =
(monthly_input_tokens × input_price_per_million / 1_000_000) +
(monthly_output_tokens × output_price_per_million / 1_000_000) The prices vary enormously by model tier. At the time of writing:
- Small, efficient models (7B-class, quantised) are available via cloud providers at roughly $0.05 to $0.20 per million tokens on the cheaper end.
- Mid-tier capable models sit in the $0.50 to $2.00 per million tokens range.
- Frontier models (the largest, most capable) run $5 to $15+ per million tokens or more.
These figures are public ballparks only. Pricing changes frequently and varies by provider, region, tier, and volume commitment. Always check the current pricing page of whichever provider you are evaluating.
The appeal of cloud APIs is real. There is no capital expenditure, no hardware to maintain, and you can burst to very high token volumes on short notice without provisioning anything. For a team building a prototype, or a workload that spikes hard and irregularly, paying per token is the right call.
The problem appears when your usage becomes large and predictable. At that point you are paying a recurring per-token tax on a workload that a fixed cost could serve.
The self-hosted side: building an honest cost model#
Self-hosting has three cost buckets: capital expenditure on hardware, ongoing power, and ongoing operations. All three need to be on the table.
Hardware capex, amortised#
GPU hardware spans a wide range. For rough orientation only (prices change, used markets vary, cloud spot instances complicate comparisons):
- A consumer-grade card capable of running 7B-14B models in 4-bit quantisation sits at the low end of the range, in the hundreds to low thousands of dollars.
- A prosumer workstation card capable of larger models or higher throughput is in the mid-thousands.
- Datacentre-class hardware (H100/H200-class, the kind you need for large unquantised models or serious throughput) is in the tens of thousands per card, often much more for multi-card configurations.
Do not use these as precise quotes. Check current prices from resellers, consider the used/refurbished market, and factor in that availability and pricing for datacentre hardware can shift sharply.
The key variable is amortisation period - how many months you spread the upfront cost across. A server you depreciate over 36 months has half the monthly hardware cost of one you depreciate over 18. A reasonable range for a GPU used for inference is 24-48 months, depending on how aggressively the hardware will be used and how quickly you expect the model landscape to make it obsolete.
monthly_hardware_cost = purchase_price / amortisation_months Power#
GPU power draw is real money, especially at Australian electricity rates.
monthly_power_cost =
(gpu_tdp_watts / 1000) × daily_hours × 30 × electricity_rate_per_kwh As a rough reference for Australian commercial/small-business electricity: rates in the range of $0.25 to $0.45 per kWh are plausible at the time of writing, but vary significantly by state, tariff, and whether you are on a time-of-use plan. Check your actual bill.
A consumer GPU drawing 250-350W running 16 hours a day will add perhaps $30-60 per month in electricity at those rates. A datacentre card at 400-700W running continuously will be substantially more, and that figure multiplies per card in a multi-GPU system.
Do not forget: the GPU is not the only draw. The host system (CPU, RAM, storage, networking) adds another 100-300W depending on configuration.
Operations and maintenance#
This is the cost that disappears from most comparisons and comes back to bite hardest.
Someone has to set up the inference server, keep it updated, handle driver issues, manage disk space, respond when the service goes down at 11pm, and integrate new model releases. If that is you, it is time you are not spending on other things.
monthly_ops_cost = ops_hours_per_month × your_hourly_rate For a single GPU box running a stable setup, ops_hours_per_month might be as low as 2-4 hours. For a fleet or a team with SLA expectations, it can be substantially more.
There is also one-off setup cost - initial hardware procurement, racking or installation, OS setup, inference server configuration. This is usually a one-time investment but worth including in the first-year calculation.
Total self-hosted cost#
monthly_selfhosted_cost =
monthly_hardware_cost +
monthly_power_cost +
monthly_ops_cost The break-even calculation#
Self-hosting is cheaper when:
monthly_cloud_cost > monthly_selfhosted_cost Rearranging: you need to find the monthly token volume at which cloud spend crosses the self-hosted floor. Below that volume, cloud wins. Above it, self-hosting wins (on cost; more on the caveats shortly).
Worked example - illustrative only#
The following numbers are completely made up for illustration. Do not use them as benchmarks. Plug in your actual costs.
Assumptions stated explicitly:
- Workload: 500 million input tokens + 100 million output tokens per month (a modest but real internal tooling load)
- Cloud model: mid-tier, $1.00/M input, $3.00/M output (illustrative rate only)
- Hardware: $8,000 GPU workstation, amortised over 36 months
- Power: 300W draw, 20 hours/day, $0.30/kWh (illustrative AUS rate)
- Ops: 4 hours/month at $100/hour (your own time, valued conservatively)
| Cost component | Monthly (illustrative) |
|---|---|
| Cloud - input tokens (500M × $1.00/M) | $500 |
| Cloud - output tokens (100M × $3.00/M) | $300 |
| Cloud total | $800 |
| Hardware amortisation ($8,000 / 36mo) | $222 |
| Power (0.3kW × 20h × 30d × $0.30) | $54 |
| Ops (4h × $100) | $400 |
| Self-hosted total | $676 |
In this illustrative scenario, self-hosting comes out cheaper by about $124/month, or roughly $1,500/year. Not dramatic. Change the ops assumption to 2 hours, and the gap widens. Change the model to a frontier tier at $10/M output, and the gap widens substantially. Change the workload to 50 million tokens/month instead of 600 million, and cloud wins comfortably.
The model is sensitive to ops cost and token volume above everything else. Those are the two variables to scrutinise most carefully in your own spreadsheet.
Non-cost factors that often decide first#
The spreadsheet matters less than you might expect, because several non-cost factors tend to force the decision before the break-even analysis runs.
Data privacy and residency#
If your workload involves personal information under the Australian Privacy Act 1988, health records, or data subject to contractual restrictions, sending it to a third-party API may not be permissible regardless of cost. The same applies to data covered by the GDPR for EU operations, or sector-specific frameworks in finance and government.
Self-hosting keeps data entirely within your own infrastructure. There is no third-party data processing agreement to negotiate, no vendor sub-processor list to audit, and no exposure if a provider changes their data handling terms.
For many organisations, this is not a cost decision. It is a compliance decision, and it settles the question immediately.
Latency and control#
Cloud APIs introduce network round-trip time. For interactive use cases - a user waiting for a response - that matters. A local inference server responds from within your network, often with lower and more consistent latency (depending on hardware and model size).
You also control the model version. A cloud provider may deprecate a model or change its behaviour in a minor update. On your own hardware, the weights are yours. Nothing changes unless you choose to update.
Absence of per-call anxiety#
At high token volumes, per-call billing creates a quiet operational stress: every inefficient prompt, every retry, every unexpectedly long output has a line-item cost. Self-hosted inference removes that mental overhead. You can experiment freely, run batch jobs without watching a meter, and iterate on prompts without a billing alarm in the back of your mind.
Vendor dependency#
A cloud API is an external dependency. Pricing can change. A model can be deprecated. An outage affects you with no recourse. Self-hosted removes that dependency, at the cost of taking on the operational responsibility yourself.
Getting more from hardware you already own#
One of the most common sources of self-hosting waste is not paying too much for hardware; it is hardware that sits at low utilisation. A GPU box that handles peak load but idles most of the day is an expensive asset earning poor returns.
This is where routing and orchestration matter. If you have multiple inference backends - a workstation, a lab GPU server, a developer's machine - requests that would otherwise queue on one can be distributed across all of them. A proxy like Olla handles this at the single-box-to-small-fleet level: one stable endpoint, load balancing across registered backends, automatic failover when one goes offline. It does not change your cost model, but it changes how much throughput you extract from the hardware cost you are already paying. There is more detail in our LLM proxy explainer.
At fleet scale - many GPUs across racks, teams with different scheduling needs, mixed model deployments - keeping utilisation high becomes a genuine orchestration problem. That is what FoundryOS addresses (currently in early access): fleet management and monitoring for inference infrastructure at the scale where idle GPU time is expensive enough to warrant a dedicated platform.
If you are at the single-box stage, neither of those is urgent. But if you are trying to justify self-hosting economics and your hardware is sitting at 30% utilisation, better routing is often the cheapest improvement available.
For guidance on the hardware and engine choices themselves, see our inference servers compared article, and for the broader picture of running LLMs on your own infrastructure, the deploying LLMs on your own infrastructure guide covers the full picture.
Key takeaways
- The five variables that matter most: monthly token volume, output-to-input ratio (output is typically billed 2-4x higher), hardware amortisation period, electricity cost (check your actual rate, not a national average), and the honest cost of your own operational time.
- Cloud wins when usage is low-volume, bursty, or unpredictable. The zero-capex, zero-ops model is genuinely better for prototypes, irregular workloads, and teams that cannot afford infrastructure distraction.
- Self-hosting wins when usage is high-volume, steady, and predictable - and when data privacy or residency requirements make third-party processing impractical regardless of cost.
- The break-even is sensitive to ops cost. Two hours of maintenance per month and ten hours per month produce very different numbers. Be honest about this before committing to hardware.
- Utilisation matters as much as hardware cost. A GPU at 80% utilisation has roughly half the effective per-token cost of the same GPU at 40% utilisation. Invest in routing and scheduling before investing in more cards.
- Do not trust a single comparison. Run the model with your token volumes, your electricity rate, and your local hardware prices. The framework above gives you the variables; the numbers have to come from your own situation.