Olla v0.0.20: LlamaCpp Integration & Anthropic API Translation

We're excited to announce the release of Olla v0.0.20, a milestone update that brings two game-changing features to local AI development. With native LlamaCpp integration for ultra-lightweight inference and experimental Anthropic API translation, developers can now run production-grade AI workloads anywhere - from high-end servers to Raspberry Pi devices - while maintaining compatibility with tools like Claude Code.

Native LlamaCpp Integration

Lightweight C++ inference engine for running LLMs on any hardware - from servers to edge devices

Performance First

LlamaCpp is a pure C++ implementation optimised for CPU inference with optional GPU acceleration. It eliminates the overhead of Python runtimes and heavy frameworks, making it perfect for resource-constrained environments.

CPU-first deployment with no GPU required
ARM support for Apple Silicon and Raspberry Pi
AVX, AVX2, and AVX512 optimisations for x86
Metal acceleration on macOS
CUDA and ROCm support when available

Quantisation Options

Choose from extensive GGUF quantisation formats to balance quality and resource usage for your specific hardware:

Q2_K Extreme compression, minimal quality

Q3_K_S/M/L High compression, acceptable quality

Q4_K_S/M Balanced performance (Recommended)

Q5_K_S/M Good quality, moderate size

Q6_K/Q8_0 Near-original quality

F16/F32 Full precision (when quality matters most)

Memory Requirements

7B Model (Q4_K_M)

4-6 GB RAM

13B Model (Q4_K_M)

8-10 GB RAM

Context (8K tokens)

~1 GB RAM

💡 Memory usage scales linearly with context length. Plan for ~1GB per 8K tokens of context in addition to model size.

Configuration Example

Getting started with LlamaCpp in Olla is straightforward. Here's a basic configuration:

# config.yaml
discovery:
  static:
    endpoints:
      - url: "http://localhost:8080"
        name: "local-llamacpp"
        type: "llamacpp"
        priority: 95

# For production with load balancing
proxy:
  engine: "olla"
  load_balancer: "round-robin"

Single-Model Architecture

One model per instance ensures predictable performance and resource usage. Perfect for dedicated deployments.

Slot-Based Concurrency

Handle multiple requests efficiently with configurable slot allocation and optional state persistence.

OpenAI API Compatible

Drop-in replacement for OpenAI API endpoints. Works seamlessly with existing tools and libraries.

Edge Deployment Ready

Run on IoT devices, embedded systems, or any ARM/x86 platform. Perfect for air-gapped environments.

Resources & Documentation

GitHub Release View release notes and download

LlamaCpp Documentation Complete integration guide

Anthropic API Guide Translation layer documentation

Olla Product Page Learn more about Olla

Olla is part of TensorFoundry's suite of tools for local AI deployment. Explore our other products including FoundryOS for GPU orchestration and AgentOS for autonomous agent deployment.