mlboydaisuke
/

Gemma-4-31B-CoreAI

Text Generation

Model card Files Files and versions

Gemma-4-31B-CoreAI / README.md

mlboydaisuke's picture

Gemma 4 31B: model card

519b356 verified 13 days ago

|

History Blame Contribute Delete

2.9 kB

	---
	license: gemma
	base_model: google/gemma-4-31B-it-qat-q4_0-unquantized
	tags:
	- core-ai
	- coreai
	- apple
	- gemma
	- gemma-4
	- on-device
	- metal
	pipeline_tag: text-generation
	library_name: coreai
	---

	# Gemma 4 31B (dense) — Core AI

	Apple Core AI (`.aimodel`) conversion of Google's Gemma 4 31B dense text decoder, ported
	directly from the QAT release
	[`google/gemma-4-31B-it-qat-q4_0-unquantized`](https://huggingface.co/google/gemma-4-31B-it-qat-q4_0-unquantized).
	Decode-only, runs on the stock pipelined engine on Apple Silicon (Mac-class, ~16 GB).

	> Frontier dense, unblocked by a custom Metal kernel. Gemma 4 31B's full (global) attention
	> layers have a 32-head × 512 Q tensor that overflows MPSGraph's GPU decode scratch heap — the stock
	> SDPA crashes at the first token ([apple/coreai-models#27](https://github.com/apple/coreai-models/issues/27),
	> the same bug as the 12B). This bundle ships a custom flash-decode SDPA kernel on the full
	> layers (block-GQA over the 31B's 4 global KV heads) that removes the offending op, so the model
	> runs.

	## Bundle (`gpu-pipelined/`)

	\| bundle \| quant \| size \| decode (M4 Max) \|
	\|---\|---\|---\|---\|
	\| `gemma4_31b_qat_decode_int4linsym_msdpa_g8` \| int4 (q4_0-aligned absmax) \| 19 GB \| 17.2 tok/s (prefill 22.1) \|

	int4 from Google's QAT checkpoint (q4_0 grid). A frontier 31B at int4 is bandwidth-bound, so decode
	is in the MLX-parity range — the value is "Core AI runs a frontier dense model the stock engine
	cannot." Mac-only (exceeds the iPhone memory budget). The `_g8` suffix is the higher-occupancy
	flash-decode kernel (8 SIMD-groups per head split the global layers' KV scan; same numerics).

	## Architecture

	Clean dense `gemma4` text decoder — no PLE / AltUp / Laurel / MoE / KV-sharing. 60 layers,
	hidden 5376, 32 heads, vocab 262144, softcap 30, tied embeddings. 5:1 sliding:full; dual head_dim
	(sliding 256 / full `global_head_dim` 512); full layers use `num_global_key_value_heads` 4 with
	`attention_k_eq_v` (value = raw k_proj). Both attention shapes ride one growing KV pair, so the
	bundle loads on the stock `CoreAIPipelinedEngine` (2 states, no engine patch); the full layers' SDPA
	runs as a custom Metal flash-decode kernel.

	## Usage

	```bash
	huggingface-cli download mlboydaisuke/Gemma-4-31B-CoreAI \
	--include "gpu-pipelined/gemma4_31b_qat_decode_int4linsym_msdpa_g8/*" \
	--local-dir ./gemma4-31b-coreai

	COREAI_CHUNK_THRESHOLD=1 llm-runner \
	--model ./gemma4-31b-coreai/gpu-pipelined/gemma4_31b_qat_decode_int4linsym_msdpa_g8 \
	--prompt "What is the capital of France?" --max-tokens 64 --chunk-size 1
	```

	## Conversion

	Community zoo:
	[github.com/john-rocky/coreai-model-zoo → `zoo/gemma4-31b.md`](https://github.com/john-rocky/coreai-model-zoo/blob/main/zoo/gemma4-31b.md).

	## License

	Gemma — governed by the [Gemma Terms of Use](https://ai.google.dev/gemma/terms).