We're excited to announce the release of Olla v0.0.20, a milestone update that brings two game-changing features to local AI development. With native LlamaCpp integration for ultra-lightweight inference and experimental Anthropic API translation, developers can now run production-grade AI workloads anywhere - from high-end servers to Raspberry Pi devices - while maintaining compatibility with tools like Claude Code.
Native LlamaCpp Integration
Lightweight C++ inference engine for running LLMs on any hardware - from servers to edge devices
Performance First
LlamaCpp is a pure C++ implementation optimised for CPU inference with optional GPU acceleration. It eliminates the overhead of Python runtimes and heavy frameworks, making it perfect for resource-constrained environments.
- CPU-first deployment with no GPU required
- ARM support for Apple Silicon and Raspberry Pi
- AVX, AVX2, and AVX512 optimisations for x86
- Metal acceleration on macOS
- CUDA and ROCm support when available
Quantisation Options
Choose from extensive GGUF quantisation formats to balance quality and resource usage for your specific hardware:
Memory Requirements
7B Model (Q4_K_M)
4-6 GB RAM13B Model (Q4_K_M)
8-10 GB RAMContext (8K tokens)
~1 GB RAM💡 Memory usage scales linearly with context length. Plan for ~1GB per 8K tokens of context in addition to model size.
Configuration Example
Getting started with LlamaCpp in Olla is straightforward. Here's a basic configuration:
# config.yaml
discovery:
static:
endpoints:
- url: "http://localhost:8080"
name: "local-llamacpp"
type: "llamacpp"
priority: 95
# For production with load balancing
proxy:
engine: "olla"
load_balancer: "round-robin"Single-Model Architecture
One model per instance ensures predictable performance and resource usage. Perfect for dedicated deployments.
Slot-Based Concurrency
Handle multiple requests efficiently with configurable slot allocation and optional state persistence.
OpenAI API Compatible
Drop-in replacement for OpenAI API endpoints. Works seamlessly with existing tools and libraries.
Edge Deployment Ready
Run on IoT devices, embedded systems, or any ARM/x86 platform. Perfect for air-gapped environments.