ADR-0018: On-Device LLM Inference Engine
DateFebruary 12, 2026
CategoryLLM Integration
Context
We need a reliable on-device inference stack to support upcoming LLM-assisted features. The app is offline-only, must run on mid-range Android phones, and should have predictable performance. We are leaning toward models from Liquid AI, but need to choose an inference engine and model format that can run those models locally with acceptable latency and memory use.
Decision
We will adopt an on-device inference engine that supports Liquid AI models and provides:
- Stable Android integration
- Quantized model support
- Streaming token generation
- Predictable memory usage
We will prioritize a lightweight native runtime (C/C++ with JNI) and ship a single model format to minimize complexity.
Rationale
- Optimizes for offline reliability and predictable performance.
- Reduces integration risk by standardizing on one engine and format.
- Keeps memory and binary size under control for mid-range devices.
Consequences
- Engine choice constrains model format and tooling.
- Any future engine switch will require model re-export and retesting.
- We must build a thin abstraction layer to isolate engine specifics.
Alternatives Considered (Optional)
- Multiple engines per device class — increases complexity and QA surface.
- Cloud fallback — violates offline-only constraint.
Notes (Optional)
- Revisit if Liquid AI releases an official Android runtime with better performance or tooling.