We're excited to release Olla 0.3.5, bringing significant performance improvements and new model support. This release focuses on optimising inference speed, especially on ARM-based devices.
New Features
Llama 3.2 Support
Full support for Meta's latest Llama 3.2 models
ARM Optimisation
25% faster inference on Apple Silicon and other ARM devices
Memory Pooling
Improved memory management reduces allocation overhead
Enhanced Metrics
More detailed performance metrics and logging
Performance Improvements
15%
Reduced latency for first token
↑
Improved throughput for batch processing
⚖️
Optimised quantisation quality/speed
⏱️
Faster model loading times
Bug Fixes
- Fixed memory leak in long-running sessions
- Resolved issue with model switching
- Corrected token counting for certain models
- Fixed crash on malformed requests
Upgrade Instructions
Docker
docker pull ghcr.io/thushan/olla:0.3.5Homebrew (macOS)
brew upgrade ollaCargo
cargo install olla --version 0.3.5Breaking Changes
None! This release is fully backward compatible with 0.3.x versions.