--- license: other license_name: lfm1.0 license_link: LICENSE base_model: LiquidAI/LFM2.5-1.2B-Instruct tags: - coreai - aimodel - apple-silicon - on-device - lfm2 - hybrid pipeline_tag: text-generation --- # LFM2.5-1.2B-Instruct — Apple Core AI (`.aimodel`) **LiquidAI's LFM2.5-1.2B-Instruct converted to Apple's Core AI** (the Core ML successor announced at WWDC26), ready to run on iOS 27 / macOS 27. A conv + full-attention hybrid (10 short-conv mixers + 6 GQA attention layers) riding Apple's **`coreai-pipelined` GPU engine** — the first non-Qwen architecture on that fast path, with zero custom kernels. > Requires the iOS 27 / macOS 27 beta (Core AI ships with the OS). Conversion code, knowledge > base, and the Swift runner: **[coreai-model-zoo](https://github.com/john-rocky/coreai-model-zoo)**. ## Measured (greedy; single-step top-1 gated 16/16 vs the fp32 Hugging Face oracle) | Surface | Bundle | Prefill | Decode | |---|---|---:|---:| | **M4 Max**, release `llm-benchmark` | ★★★ `gpu-pipelined/lfm2_5_1_2b_instruct_decode_int8hu_block32_sym/` (1.6 GB) | 277.8 tok/s | **276.5 tok/s** | | **iPhone 17 Pro**, one-shot runner | ★★★ same bundle | 44.2–46.6 | **44.1–46.6 tok/s** | | M4 Max, release `llm-benchmark` | ★★ `gpu-pipelined/lfm2_5_1_2b_instruct_decode_int8lin/` (1.5 GB) | 253.3 tok/s | **253.3 tok/s** | | iPhone 17 Pro, one-shot runner | ★★ same bundle | 39.2–39.4 | **38.0–39.6 tok/s** | | iPhone 17 Pro, chat app (CoreAIChat LFM mode, 200-tok turn) | int8lin bundle | 30.7 | **35.8 tok/s** | - **★★★ = the ship config** (`int8hu_block32_sym`): int8lin + the tied lm_head untied and quantized **absmax per-block-32 int8** (`symmetric`, no clipping — clipping corrupts big-vocab heads). +9% on M4 Max, **+15–20% on iPhone** (44.1–46.6 ≈ ~94–98% of the naive bandwidth ceiling, ~60 GB/s ÷ ~1.27 GB/token); warm engine load 0.3 s. Greedy rollouts are token-identical to the int8lin bundle on both verification prompts; oracle gate 16/16 + decode step, device numerics 24/24 ≡ Mac-GPU on all 3 runs. - ★★ int8lin: the fp16-head variant (what CoreAIChat currently downloads); ~87% of its ceiling on iPhone. Cold GPU specialization 6.8 s, warm load 1.6 s; no AOT compile needed. - iPhone greedy sequences are **24/24 token-identical to the M4 Max GPU** on both fixed verification prompts (both bundles). - For scale: our Qwen3.5-0.8B on the same engine does 210 tok/s on M4 Max — this 1.2B does 276.5. ## What the bundle is One full **LanguageBundle** (`.aimodel` + `tokenizer/` + `metadata.json`): decode-only graph, `input_ids` static `[1,1]`, position_ids + KV seq dynamic (→ the engine factory selects `coreai-pipelined`: async non-blocking encode, on-GPU argmax sampling, on-device KV growth). Weights are **int8 linear per-block-32** (scale-multiply dequant — no LUT; k-means LUT gathers measure slower on this GPU delegate) with the embedding, depthwise convs, norms, and the four attention projections kept high-precision; in the ★★★ bundle the lm_head is untied and quantized absmax per-block-32 int8 too (in the ★★ bundle it stays fp16/tied). Do NOT re-quantize the head per-channel: per-channel (axis-0) int8 weights are broken on the current beta GPU delegate (garbage logits — delegate lowering bug, documented in the zoo knowledge base). The attention projections are fp32 on purpose: under a dynamic-shape graph the delegate's fp16 attention-prologue matmuls lose ~1.3% relative accuracy, which LFM2.5's large q/k-norm gains amplify into wrong logits — fp32 there restores layer-level exactness (+126 MB). Full write-up: [`knowledge/pipelined-engine.md`](https://github.com/john-rocky/coreai-model-zoo/blob/main/knowledge/pipelined-engine.md). ## Run it ```bash git clone https://github.com/john-rocky/coreai-model-zoo git clone https://github.com/apple/coreai-models git -C coreai-models apply ../coreai-model-zoo/apps/coreai-shared-product.patch \ ../coreai-model-zoo/apps/coreai-pipelined-extra-states.patch # (the extra-states patch lets the engine carry the conv state as a fixed-shape extra state) # download this bundle into coreai-models/exports/, then: cd coreai-models && swift build -c release COREAI_CHUNK_THRESHOLD=1 ./.build/release/llm-benchmark \ --model exports/lfm2_5_1_2b_instruct_decode_int8hu_block32_sym -p 128 -g 256 -n 3 ``` Run contract (each of these matters): - `COREAI_CHUNK_THRESHOLD=1` **before engine creation** — prefill must run as pipelined S=1 steps (prompt tok/s ≈ decode tok/s). - **Never call `engine.warmup()`** on this S=1 bundle (it warms query length 256, which the static `[1,1]` graph rejects). A 1-token generate after load is the warmup; `llm-runner` needs `--warmup exact --warmup-length 1`. - Benchmark **Release** builds only (a Debug engine measures ~3× slow). On iPhone, the [CoreAIChat sample app](https://github.com/john-rocky/coreai-model-zoo/tree/main/apps/CoreAIChat) has an **LFM** picker mode that downloads this repo in-app and chats through this bundle. ## Conversion Reproducible with [`conversion/export_lfm2_decode_pipelined.py`](https://github.com/john-rocky/coreai-model-zoo/blob/main/conversion/export_lfm2_decode_pipelined.py) (+ the `models/macos/lfm2.py` overlay) from the upstream HF checkpoint. Numerics are gated the strict way: a teacher-forced S=1 sweep over a 16-position oracle prompt (top-1 vs the fp32 HF reference at every position, 16/16 required) plus an oracle-cache-seeded decode step — not long-rollout eyeballing. Model card with the full method and the GPU-delegate findings: [`zoo/lfm2.5.md`](https://github.com/john-rocky/coreai-model-zoo/blob/main/zoo/lfm2.5.md). ## License The model weights derive from [LiquidAI/LFM2.5-1.2B-Instruct](https://huggingface.co/LiquidAI/LFM2.5-1.2B-Instruct) and are redistributed under the **LFM Open License v1.0** (see [LICENSE](LICENSE)): Apache-style grants, but **Commercial Use is licensed only for entities below US$10M annual revenue** (qualified non-profits exempt for non-commercial/research use). The conversion code is BSD-3-Clause (zoo repo).