Back to News

Olla v0.0.20: Native LlamaCpp Support & Anthropic API Translation

We're excited to announce the release of Olla v0.0.20, a milestone update that brings two game-changing features to local AI development. With native LlamaCpp integration for ultra-lightweight inference and experimental Anthropic API translation, developers can now run production-grade AI workloads anywhere - from high-end servers to Raspberry Pi devices - while maintaining compatibility with tools like Claude Code.

Native LlamaCpp Integration

Lightweight C++ inference engine for running LLMs on any hardware - from servers to edge devices

Performance First

LlamaCpp is a pure C++ implementation optimised for CPU inference with optional GPU acceleration. It eliminates the overhead of Python runtimes and heavy frameworks, making it perfect for resource-constrained environments.

  • CPU-first deployment with no GPU required
  • ARM support for Apple Silicon and Raspberry Pi
  • AVX, AVX2, and AVX512 optimisations for x86
  • Metal acceleration on macOS
  • CUDA and ROCm support when available

Quantisation Options

Choose from extensive GGUF quantisation formats to balance quality and resource usage for your specific hardware:

Q2_K Extreme compression, minimal quality
Q3_K_S/M/L High compression, acceptable quality
Q5_K_S/M Good quality, moderate size
Q6_K/Q8_0 Near-original quality
F16/F32 Full precision (when quality matters most)

Memory Requirements

7B Model (Q4_K_M)

4-6 GB RAM

13B Model (Q4_K_M)

8-10 GB RAM

Context (8K tokens)

~1 GB RAM

💡 Memory usage scales linearly with context length. Plan for ~1GB per 8K tokens of context in addition to model size.

Configuration Example

Getting started with LlamaCpp in Olla is straightforward. Here's a basic configuration:

# config.yaml
discovery:
  static:
    endpoints:
      - url: "http://localhost:8080"
        name: "local-llamacpp"
        type: "llamacpp"
        priority: 95

# For production with load balancing
proxy:
  engine: "olla"
  load_balancer: "round-robin"

Single-Model Architecture

One model per instance ensures predictable performance and resource usage. Perfect for dedicated deployments.

Slot-Based Concurrency

Handle multiple requests efficiently with configurable slot allocation and optional state persistence.

OpenAI API Compatible

Drop-in replacement for OpenAI API endpoints. Works seamlessly with existing tools and libraries.

Edge Deployment Ready

Run on IoT devices, embedded systems, or any ARM/x86 platform. Perfect for air-gapped environments.

Resources & Documentation

Olla is part of TensorFoundry's suite of tools for local AI deployment. Explore our other products including FoundryOS for GPU orchestration and AgentOS for autonomous agent deployment.