ADR-0021: On-Device LLM Runtime Architecture
DateFebruary 12, 2026
CategoryLLM Integration
Tagsllm-inferencemodel-download
Context
- ADR-0018 established that we need an on-device inference engine for Liquid AI models with stable Android integration, quantized model support, streaming tokens, and predictable memory.
- ADR-0019 defined a strict memory budget targeting small, quantized models on mid-range phones.
- ADR-0020 committed to a local, telemetry-free evaluation harness for validating model quality and performance.
- This ADR makes the concrete technical choices: which model, which runtime, which format, which integration pattern, and how the LLM subsystem fits into LocusFlow's existing architecture.
Constraints:
- Offline-only: no cloud fallback, no network calls during inference.
- Mid-range Android target: 4–6 GB total RAM, Snapdragon 6xx/7xx-class SoCs (arm64-v8a).
- The app is Kotlin-first, uses Jetpack Compose, Hilt, Room, and follows MVVM + Repository + UseCase layering (see architecture.md).
- Binary size and cold-start time must remain reasonable for a productivity app; users should not wait more than a few seconds to begin inference.
Decision
1. Model Selection: LFM2.5-1.2B-Instruct
We will use Liquid AI's LFM2.5-1.2B-Instruct as our primary on-device model.
Justification:
- Best-in-class quality at 1B scale. LFM2.5-1.2B-Instruct achieves state-of-the-art scores across MMLU (58.55→improved further in 2.5), instruction following (IFEval), math, and tool use benchmarks among 1B-class models — outperforming Llama 3.2 1B, Gemma 3 1B, and competing with Qwen3-1.7B despite being 40% smaller.
- Memory-efficient architecture. LFMs use a hybrid architecture (10 double-gated short-range LIV convolution blocks + 6 GQA attention blocks) that provides near-constant memory complexity as input context grows, unlike pure transformer KV-cache scaling. This is critical for mid-range devices.
- Proven on-device performance. On a Samsung Galaxy S25 Ultra (Snapdragon 8 Elite) with Q4_0 quantization via llama.cpp: 335 tok/s prefill, 70 tok/s decode, 719 MB memory. Mid-range devices will be slower but within acceptable bounds.
- 32k effective context length. Confirmed via the RULER benchmark (score >85.6 at 32k), enabling document-length reflection summaries and weekly synthesis inputs.
- Open-weight with permissive license. Apache 2.0-based license allows free commercial use for companies under $10M revenue; fully self-contained, no license server needed.
- Rich ecosystem. Available in GGUF, ONNX, MLX formats; supported by llama.cpp, LEAP Edge SDK, Ollama, vLLM, and more.
Future upgrade path: LFM2.5-1.2B-Thinking for reasoning-heavy features (math in reflections, complex synthesis). Same memory footprint (~900 MB), significantly improved math (+25 points on MATH-500) and tool use capabilities.
2. Primary Runtime: LEAP Edge SDK for Android
We will use Liquid AI's LEAP Edge SDK (ai.liquid.leap:leap-sdk) as the primary inference runtime.
Justification:
- Purpose-built for Android/Kotlin. The SDK is Kotlin-first, integrates with coroutines and Flow, and provides a ViewModel-friendly API that maps directly onto our MVVM architecture.
- Managed model lifecycle.
LeapModelDownloader handles GGUF model downloads with WorkManager integration, progress tracking, and caching. ModelRunner and Conversation objects manage model loading/unloading.
- Streaming token generation.
Conversation.generateResponse() returns a Flow<MessageResponse> with Chunk, ReasoningChunk, and Complete events — directly consumable by our ViewModels.
- Constrained generation. Built-in support for JSON schema-constrained output (critical for producing structured briefings, reflection summaries, and weekly synthesis).
- Function calling support. Enables future agentic features where the model calls app-defined tools.
- GGUF-native. Loads GGUF quantized models directly; no separate conversion step.
- Actively maintained. Current version v0.9.7; Liquid AI is investing heavily in edge deployment.
Dependency declaration:
// gradle/libs.versions.toml
[versions]
leapSdk = "0.9.7"
[libraries]
leap-sdk = { module = "ai.liquid.leap:leap-sdk", version.ref = "leapSdk" }
leap-model-downloader = { module = "ai.liquid.leap:leap-model-downloader", version.ref = "leapSdk" }
3. Fallback Runtime: llama.cpp via JNI
We will maintain llama.cpp as a fallback and benchmarking runtime.
Justification:
- Provides an independent inference path for validating LEAP SDK results in the test harness (ADR-0020).
- Open-source C++ library with no vendor lock-in; well-understood GGUF format.
- Useful for CI benchmarking on non-Android hardware (Linux x86).
- If LEAP SDK introduces breaking changes or licensing shifts, llama.cpp can serve as a drop-in replacement with the same GGUF model files.
We will NOT ship llama.cpp in the production APK unless LEAP SDK proves inadequate. Both runtimes consume the same GGUF model files, so switching is a code-level change, not a model-level one.
4. Model Format and Quantization: GGUF Q4_K_M
We will standardize on GGUF format with Q4_K_M quantization.
| Property | Value |
|---|
| Format | GGUF (llama.cpp native) |
| Quantization | Q4_K_M |
| Approximate model size on disk | ~700 MB |
| Approximate runtime memory | ~850–950 MB |
| Context length | 4096 tokens (app-enforced limit; model supports up to 32k) |
Quantization trade-offs considered:
| Quantization | Disk Size | Memory | Quality | Decision |
|---|
| Q2_K | ~450 MB | ~550 MB | Significant degradation | Rejected: quality too low for summarization |
| Q3_K_M | ~550 MB | ~700 MB | Noticeable degradation | Viable fallback for low-RAM devices |
| Q4_K_M | ~700 MB | ~900 MB | Near-full precision | Selected: best size/quality balance |
| Q5_K_M | ~850 MB | ~1100 MB | Minimal quality loss | Too large for mid-range 4 GB devices |
| Q8_0 | ~1.2 GB | ~1500 MB | Near-lossless | Rejected: exceeds memory budget |
Q4_K_M is the recommended quantization from Liquid AI's own documentation and offers the best balance of size and quality.
5. App-Enforced Context and Token Limits
| Parameter | Value | Rationale |
|---|
| Max input context | 2048 tokens | Keeps prefill latency under 6s on mid-range; sufficient for daily reflection + history |
| Max output tokens | 512 tokens | Caps generation time; sufficient for briefings and summaries |
| Temperature | 0.3 | Low randomness for consistent, factual summarization |
| Repetition penalty | 1.05 | Mild penalty to avoid loops |
| Min_p | 0.15 | Filters very low-probability tokens for coherent output |
These limits will be feature-configurable via the abstraction layer (see §6), so different features (briefing vs. reflection summary vs. weekly synthesis) can tune independently.
6. Architecture Integration: Abstraction Layer
We will introduce a thin LlmService abstraction that isolates the inference engine from business logic, consistent with our Repository + UseCase layering.
app/
├── domain/
│ ├── llm/
│ │ ├── LlmService.kt # Interface: generateText, generateStructured
│ │ ├── LlmConfig.kt # Data class: temperature, maxTokens, etc.
│ │ └── LlmResult.kt # Sealed class: Success(text, stats), Error(reason)
│ └── usecase/
│ ├── GenerateMorningBriefingUseCase.kt
│ ├── SummarizeDailyReflectionUseCase.kt
│ └── GenerateWeeklySynthesisUseCase.kt
├── data/
│ └── llm/
│ ├── LeapLlmService.kt # LEAP SDK implementation of LlmService
│ ├── LlamaCppLlmService.kt # llama.cpp fallback (test/benchmark only)
│ ├── ModelManager.kt # Model download, caching, lifecycle
│ └── PromptTemplateRepository.kt # Manages prompt templates per feature
└── di/
└── LlmModule.kt # Hilt module binding LlmService
Key interfaces:
// domain/llm/LlmService.kt
interface LlmService {
val isModelLoaded: StateFlow<Boolean>
val modelLoadProgress: StateFlow<Float>
suspend fun loadModel()
suspend fun unloadModel()
suspend fun generateText(prompt: String, config: LlmConfig): Flow<LlmResult>
suspend fun generateStructured(prompt: String, config: LlmConfig, schema: String): Flow<LlmResult>
}
// domain/llm/LlmResult.kt
sealed class LlmResult {
data class Chunk(val text: String) : LlmResult()
data class Complete(val fullText: String, val tokenCount: Int, val latencyMs: Long) : LlmResult()
data class Error(val reason: String, val exception: Throwable? = null) : LlmResult()
}
The LeapLlmService implementation wraps ModelRunner / Conversation from the LEAP SDK and maps MessageResponse events to our LlmResult sealed class. This ensures:
- No LEAP SDK types leak into domain or presentation layers.
- Easy swap to llama.cpp or a future Liquid AI official runtime.
- Testability: use cases can be unit-tested with a fake
LlmService.
7. Model Delivery Strategy
We will use a download-on-first-use strategy via LEAP's LeapModelDownloader:
- First launch: The app works fully without the LLM. All LLM-assisted features show a "Download AI model" prompt.
- User-initiated download: ~700 MB download over WiFi (with progress UI and WorkManager persistence across app kills).
- Cached locally: Model stored in app-specific cache (
Context.cacheDir); survives app restarts.
- Lazy loading: Model loaded into memory only when an LLM feature is invoked; unloaded when the user navigates away or after an idle timeout (configurable, default 5 minutes).
- No bundling in APK: Keeps APK size under 20 MB; avoids Play Store size limits.
8. Memory Management and Lifecycle
- Model loading: Performed on
Dispatchers.IO in a coroutine; UI shows a loading indicator.
- Model unloading: Triggered by ViewModel
onCleared(), app backgrounding (ON_STOP lifecycle event), or idle timeout. Uses modelRunner.unload().
- OOM protection: Before loading, check
ActivityManager.getMemoryInfo() for available RAM. If available RAM < 1.2 GB, show a warning or defer loading.
- Thermal throttling: Monitor
PowerManager thermal status; if THERMAL_STATUS_SEVERE or above, defer or cancel inference.
9. Threading Model
| Operation | Thread / Dispatcher | Rationale |
|---|
| Model download | WorkManager (background) | Survives process death |
| Model loading | Dispatchers.IO | Heavy I/O, 2–5s duration |
| Token generation | LEAP SDK internal thread pool | SDK manages native threads; emits to coroutine Flow |
| Prompt construction | Dispatchers.Default | CPU-bound string operations |
| UI state updates | Dispatchers.Main | Compose recomposition |
10. Security and Privacy
- No network during inference: Model runs entirely on-device. The only network call is the one-time model download from Liquid AI's CDN.
- No telemetry: No inference logs, prompts, or outputs are sent anywhere.
- Prompt sanitization: All prompts are constructed from local data (reflections, inbox items) and hardcoded templates. No user-generated prompts are sent to the model as freeform input.
- Model integrity: LEAP SDK verifies GGUF checksums during download.
Rationale
- LEAP SDK over raw llama.cpp JNI: Building and maintaining a JNI bridge to llama.cpp is significant engineering effort (NDK build, ABI management, memory lifecycle across JNI boundary). LEAP SDK provides a production-quality Kotlin API that handles all of this, with streaming, constrained generation, and model management built in.
- LEAP SDK over ONNX Runtime: ONNX Runtime is a strong general-purpose option but lacks the Kotlin-first ergonomics, built-in model downloading, and Liquid AI model optimization that LEAP provides.
- LFM2.5-1.2B over LFM2-700M or LFM2-350M: The 1.2B model is the sweet spot for summarization and briefing quality. The 700M and 350M models are faster but produce noticeably worse multi-sentence outputs. For LocusFlow's use cases (reflection summarization, morning briefings, weekly synthesis), output quality is more important than marginal latency gains.
- LFM2.5-1.2B over LFM2-2.6B: The 2.6B model would exceed memory budget on mid-range 4 GB devices after accounting for the app's own memory usage (~500 MB for Room DB, UI, etc.).
- GGUF over ExecuTorch: LEAP SDK supports both, but GGUF is recommended for all new projects (per Liquid AI docs) due to superior inference performance and better default generation parameters. ExecuTorch bundles are legacy.
- Download-on-first-use over APK bundling: A 700 MB model in the APK would push it well beyond Play Store limits and penalize users who don't use LLM features.
- Single model format: Shipping one GGUF file reduces testing surface, storage requirements, and user confusion. Q4_K_M is the one format to support.
Consequences
-
Positive:
- Clean integration path with existing Kotlin/MVVM/Hilt architecture.
- LEAP SDK handles the hardest parts (native inference, model management, streaming).
- Same GGUF model file works across LEAP, llama.cpp, and Ollama — no vendor lock-in on the model level.
- Abstraction layer makes the inference engine swappable without touching use cases or UI.
- Model download is user-controlled; no surprise battery or data usage.
-
Downsides:
- LEAP SDK is still pre-1.0 (v0.9.7); API surface may change.
- ~700 MB download required before LLM features work; poor first-run experience for impatient users.
- Mid-range devices may see 3–5s model load time and ~15–30 tok/s decode speed (acceptable but not instant).
- Liquid AI license requires commercial license for >$10M revenue companies (not relevant for current phase but noted).
-
Follow-up work:
- Feature-specific prompt templates (separate ADRs per feature: briefing, reflection summary, weekly synthesis).
- Integration of
LlmService into the test harness defined in ADR-0020.
- Evaluation of LFM2.5-1.2B-Thinking for reasoning-heavy features.
- Device-tier profiling to determine if Q3_K_M fallback is needed for <4 GB RAM devices.
Alternatives Considered
- Raw llama.cpp via JNI — Viable but requires maintaining NDK builds, JNI bindings, and model lifecycle management that LEAP SDK already solves. Retained as fallback only.
- ONNX Runtime for Android — Cross-platform but lacks Kotlin-first API, streaming Flow support, and built-in model downloading. Would require significant wrapper code.
- MediaPipe LLM Inference API (Google) — Supports on-device LLMs but is optimized for Gemma/Gemini models; limited support for non-Google architectures like LFM.
- ExecuTorch (Meta) — Supported by LEAP but marked as legacy; GGUF path has better performance and is the recommended path forward.
- TensorFlow Lite — Referenced in architecture.md Phase 7 notes, but does not natively support autoregressive LLM generation with streaming. Not suitable.
- Ship LFM2-350M as a "lite" model — Considered for ultra-low-end devices; rejected for now because summarization quality drops significantly. May revisit as a user-selectable option.
- Ollama on Android — Ollama is a desktop tool; no native Android library. Would require running a local server, which adds complexity and battery drain.
Notes
- LEAP SDK requires Android API 31+ (Android 12), arm64-v8a ABI, and 3 GB+ RAM. These requirements align with our existing
minSdk target.
- The LEAP SDK may crash on emulators when loading model bundles; physical device testing is required.
- Liquid AI's Apollo app (available on Google Play) can be used to independently vibe-check LFM models on-device before integrating.
- All LFM2.5 models share a 32k token context window. We enforce a lower app-level limit (2048 input + 512 output) to control latency and memory on mid-range devices.
- The
PromptTemplateRepository will version prompt templates independently from app releases, enabling A/B testing of prompts via the local test harness without code changes.
- This ADR supersedes the tentative mention of "TensorFlow Lite or ML Kit" in architecture.md's Phase 7 section. That section should be updated to reference this ADR.