---
license: other
license_name: lfm1.0
license_link: LICENSE
base_model: LiquidAI/LFM2.5-1.2B-Instruct
tags:
  - coreai
  - aimodel
  - apple-silicon
  - on-device
  - lfm2
  - hybrid
pipeline_tag: text-generation
---

# LFM2.5-1.2B-Instruct — Apple Core AI (`.aimodel`)

**LiquidAI's LFM2.5-1.2B-Instruct converted to Apple's Core AI** (the Core ML successor
announced at WWDC26), ready to run on iOS 27 / macOS 27. A conv + full-attention hybrid
(10 short-conv mixers + 6 GQA attention layers) riding Apple's **`coreai-pipelined` GPU
engine** — the first non-Qwen architecture on that fast path, with zero custom kernels.

> Requires the iOS 27 / macOS 27 beta (Core AI ships with the OS). Conversion code, knowledge
> base, and the Swift runner: **[coreai-model-zoo](https://github.com/john-rocky/coreai-model-zoo)**.

## Measured (greedy; single-step top-1 gated 16/16 vs the fp32 Hugging Face oracle)

| Surface | Bundle | Prefill | Decode |
|---|---|---:|---:|
| **M4 Max**, release `llm-benchmark` | ★★★ `gpu-pipelined/lfm2_5_1_2b_instruct_decode_int8hu_block32_sym/` (1.6 GB) | 277.8 tok/s | **276.5 tok/s** |
| **iPhone 17 Pro**, one-shot runner | ★★★ same bundle | 44.2–46.6 | **44.1–46.6 tok/s** |
| M4 Max, release `llm-benchmark` | ★★ `gpu-pipelined/lfm2_5_1_2b_instruct_decode_int8lin/` (1.5 GB) | 253.3 tok/s | **253.3 tok/s** |
| iPhone 17 Pro, one-shot runner | ★★ same bundle | 39.2–39.4 | **38.0–39.6 tok/s** |
| iPhone 17 Pro, chat app (CoreAIChat LFM mode, 200-tok turn) | int8lin bundle | 30.7 | **35.8 tok/s** |

- **★★★ = the ship config** (`int8hu_block32_sym`): int8lin + the tied lm_head untied and
  quantized **absmax per-block-32 int8** (`symmetric`, no clipping — clipping corrupts
  big-vocab heads). +9% on M4 Max, **+15–20% on iPhone** (44.1–46.6 ≈ ~94–98% of the naive
  bandwidth ceiling, ~60 GB/s ÷ ~1.27 GB/token); warm engine load 0.3 s. Greedy rollouts are
  token-identical to the int8lin bundle on both verification prompts; oracle gate 16/16 +
  decode step, device numerics 24/24 ≡ Mac-GPU on all 3 runs.
- ★★ int8lin: the fp16-head variant (what CoreAIChat currently downloads); ~87% of its
  ceiling on iPhone. Cold GPU specialization 6.8 s, warm load 1.6 s; no AOT compile needed.
- iPhone greedy sequences are **24/24 token-identical to the M4 Max GPU** on both fixed
  verification prompts (both bundles).
- For scale: our Qwen3.5-0.8B on the same engine does 210 tok/s on M4 Max — this 1.2B does
  276.5.

## What the bundle is

One full **LanguageBundle** (`.aimodel` + `tokenizer/` + `metadata.json`): decode-only graph,
`input_ids` static `[1,1]`, position_ids + KV seq dynamic (→ the engine factory selects
`coreai-pipelined`: async non-blocking encode, on-GPU argmax sampling, on-device KV growth).
Weights are **int8 linear per-block-32** (scale-multiply dequant — no LUT; k-means LUT
gathers measure slower on this GPU delegate) with the embedding, depthwise convs, norms,
and the four attention projections kept high-precision; in the ★★★ bundle the lm_head is
untied and quantized absmax per-block-32 int8 too (in the ★★ bundle it stays fp16/tied).
Do NOT re-quantize the head per-channel: per-channel (axis-0) int8 weights are broken on
the current beta GPU delegate (garbage logits — delegate lowering bug, documented in the
zoo knowledge base). The attention projections
are fp32 on purpose: under a dynamic-shape graph the delegate's fp16 attention-prologue
matmuls lose ~1.3% relative accuracy, which LFM2.5's large q/k-norm gains amplify into wrong
logits — fp32 there restores layer-level exactness (+126 MB). Full write-up:
[`knowledge/pipelined-engine.md`](https://github.com/john-rocky/coreai-model-zoo/blob/main/knowledge/pipelined-engine.md).

## Run it

```bash
git clone https://github.com/john-rocky/coreai-model-zoo
git clone https://github.com/apple/coreai-models
git -C coreai-models apply ../coreai-model-zoo/apps/coreai-shared-product.patch \
                           ../coreai-model-zoo/apps/coreai-pipelined-extra-states.patch
# (the extra-states patch lets the engine carry the conv state as a fixed-shape extra state)

# download this bundle into coreai-models/exports/, then:
cd coreai-models && swift build -c release
COREAI_CHUNK_THRESHOLD=1 ./.build/release/llm-benchmark \
    --model exports/lfm2_5_1_2b_instruct_decode_int8hu_block32_sym -p 128 -g 256 -n 3
```

Run contract (each of these matters):
- `COREAI_CHUNK_THRESHOLD=1` **before engine creation** — prefill must run as pipelined S=1
  steps (prompt tok/s ≈ decode tok/s).
- **Never call `engine.warmup()`** on this S=1 bundle (it warms query length 256, which the
  static `[1,1]` graph rejects). A 1-token generate after load is the warmup;
  `llm-runner` needs `--warmup exact --warmup-length 1`.
- Benchmark **Release** builds only (a Debug engine measures ~3× slow).

On iPhone, the [CoreAIChat sample app](https://github.com/john-rocky/coreai-model-zoo/tree/main/apps/CoreAIChat)
has an **LFM** picker mode that downloads this repo in-app and chats through this bundle.

## Conversion

Reproducible with
[`conversion/export_lfm2_decode_pipelined.py`](https://github.com/john-rocky/coreai-model-zoo/blob/main/conversion/export_lfm2_decode_pipelined.py)
(+ the `models/macos/lfm2.py` overlay) from the upstream HF checkpoint. Numerics are gated
the strict way: a teacher-forced S=1 sweep over a 16-position oracle prompt (top-1 vs the
fp32 HF reference at every position, 16/16 required) plus an oracle-cache-seeded decode
step — not long-rollout eyeballing. Model card with the full method and the GPU-delegate
findings: [`zoo/lfm2.5.md`](https://github.com/john-rocky/coreai-model-zoo/blob/main/zoo/lfm2.5.md).

## License

The model weights derive from
[LiquidAI/LFM2.5-1.2B-Instruct](https://huggingface.co/LiquidAI/LFM2.5-1.2B-Instruct) and are
redistributed under the **LFM Open License v1.0** (see [LICENSE](LICENSE)): Apache-style
grants, but **Commercial Use is licensed only for entities below US$10M annual revenue**
(qualified non-profits exempt for non-commercial/research use). The conversion code is
BSD-3-Clause (zoo repo).