AI Infrastructure Blog - GPU & LLM Tools

Deep Dive 24 June 2026

The Bandwidth Wall: A Roofline for Local LLMs

Why token generation on a Mac is bound by memory bandwidth, not compute - the roofline model, prefill versus decode, the M-series bandwidth ladder, and what the M5 actually changed.

Deep Dive 24 June 2026

Under the Hood of MLX: How a Model Runs on a Mac

A deep look at how MLX actually executes an LLM on Apple Silicon - lazy evaluation, graph fusion with mx.compile, unified memory, wired residency and the quantised matmul kernels.

Deep Dive 19 June 2026

MLX vs GGUF: How 4-bit Quantisation Really Works

A deep look at how 4-bit quantisation works on a Mac - MLX group quant, GGUF k-quants, why they differ in quality and speed, and the rotation trick pushing past 4-bit.

Guide 19 June 2026

LLM Quantisation: A Field Guide for 2026

A map of modern LLM quantisation: the number formats, PTQ vs QAT, the method families, sub-4-bit, KV-cache compression and native low-precision hardware.

Guide 14 June 2026

Run MLX Behind Olla on Your Mac

Put oMLX and your other Mac inference backends behind one OpenAI endpoint with Olla - model unification across MLX and GGUF naming, Anthropic passthrough and failover.

Guide 14 June 2026

How MLX Runs LLMs on Apple Silicon

How Apple's MLX framework runs LLMs on Apple Silicon - unified memory, the Neural Engine myth, the M5 speed-up, and how it compares to llama.cpp and Ollama.

Deep Dive 13 June 2026

What we found when we benchmarked Olla

We measured Olla's routing, latency, memory and failover on a Windows dev box, including a head-to-head against LiteLLM. Here are the numbers and how we got them.

Guide 4 June 2026

Self-Hosted LLM vs Cloud API - A Cost Framework

A transparent framework for comparing the cost of self-hosted LLM inference against cloud APIs - the variables that matter, the break-even maths, and where each wins.

Comparison 4 June 2026

Olla vs LiteLLM - Choosing an LLM Proxy

An honest comparison of Olla and LiteLLM - where each fits, where each wins, and how to choose between a Go-based local-first proxy and a Python provider hub.

Guide 4 June 2026

LLM Inference Servers Compared - vLLM, SGLang, llama.cpp and Ollama

A practical comparison of the main LLM inference backends - vLLM, SGLang, llama.cpp and Ollama - what each is built for, the hardware they suit, and how to choose.

Guide 4 June 2026

Deploying LLMs on Your Own Infrastructure - A Practical Guide

A complete guide to running large language models on your own infrastructure - why teams do it, the stack from backends to orchestration, hardware, cost and compliance.

Guide 7 May 2026

What is an LLM Proxy?

A practical look at what an LLM proxy does, why you end up needing one, and how it sits in front of inference backends like Ollama, vLLM and llama.cpp.