---
model-index:
- name: >-
    Granite-4.0-H-Tiny-MoE — MLX (Apple Silicon), **5-bit** (with guidance for
    2/3/4/6-bit)
  results: []
license: apache-2.0
language:
- en
tags:
- ibm
- granite
- mlx
- apple-silicon
- mamba2
- transformer
- hybrid
- moe
- long-context
- instruct
- quantized
- 5bit
- MoE
pipeline_tag: text-generation
library_name: mlx
base_model:
- ibm-granite/granite-4.0-h-tiny
---

# Granite-4.0-H-Tiny — **MLX 5-bit** (Apple Silicon)
**Maintainer / Publisher:** [**Susant Achary**](https://huggingface.co/Susant-Achary)

This repository provides an **Apple-Silicon MLX build** of **IBM Granite-4.0-H-Tiny** quantized to **5-bit**.  
If you need **more faithfulness** than 3/4-bit but want **lower RAM** than 6-bit, 5-bit is a strong middle ground—especially for **document parsing, structured extraction, and long-context** assistants on Mac.

---

## 🔎 About Granite 4.0 (context)
- **Architecture:** Hybrid **Mamba-2 + softmax attention**; *H* tiers add **Mixture-of-Experts (MoE)** (sparse activation per token).
- **Tier:** **H-Tiny** (~7B total params with ~1B active via MoE), designed for **efficient long-context** inference.
- **License:** **Apache-2.0** (permissive, enterprise-friendly).
- **Typical uses:** Instruction following, long-context assistants, RAG pipelines, structured outputs.

> This card documents the **MLX 5-bit** conversion. See the comparison table below for when to choose 3/4-bit (lower RAM) or 6-bit (highest fidelity).

---

## 📦 What’s in this repo (MLX format)
- `config.json` (MLX), `mlx_model*.safetensors` (**5-bit** shards)
- Tokenizer files: `tokenizer.json`, `tokenizer_config.json`
- Model metadata (e.g., `model_index.json`)

Target platform: **macOS** on **Apple Silicon (M-series)** with **Metal/MPS** acceleration.

---

## ✅ Intended use
- **High-quality** instruction following and summarization with **long context**
- **Document / form / table** parsing and **JSON** extraction (schema-guided prompts)
- **On-device** prototyping where accuracy matters but RAM is modest

## ⚠️ Limitations
- Still quantized: some regressions vs FP16 can surface on intricate math/code.
- KV cache / context length can dominate RAM at very long windows—monitor budgets.
- Add your own **guardrails and safety** for production.

---

## 🔢 Choosing a quantization level (MLX on Apple Silicon)
Indicative ranges for a ~7B hybrid MoE LM (actual usage varies by context length and batch size).

| Variant | Typical Peak RAM | Relative Speed | Typical Behavior | When to Choose |
|---|---:|:---:|---|---|
| **2-bit** | ~3–4 GB | 🔥🔥🔥🔥 | Smallest, most lossy | Minimal RAM devices; smoke tests |
| **3-bit** | ~5–6 GB | **🔥🔥🔥🔥** | Direct, concise | Great default on M1/M2/M3/M4 |
| **4-bit** | ~6–7.5 GB | 🔥🔥🔥 | Better detail retention vs 3-bit | If 3-bit misses small details |
| **5-bit** *(this repo)* | **~8–9 GB** | **🔥🔥☆** | **Higher fidelity**, fewer omissions | When you want stronger document/JSON faithfulness without 6-bit RAM |
| **6-bit** | ~9.5–11 GB | 🔥🔥 | Highest MLX fidelity | If RAM permits and you need maximum quality |

**Rules of thumb**
- Start at **5-bit** for **document/structured** tasks on 8–16 GB Macs.
- Drop to **3/4-bit** for tighter RAM / higher speed.
- Move to **6-bit** if you still see omissions or slight distortions in outputs.

---

## 🚀 Quickstart (CLI — MLX)

**Deterministic generation**
```bash
python -m mlx_lm.generate \
  --model <this-repo-id> \
  --prompt "Summarize the following in 5 bullet points:\n<your text>" \
  --max-tokens 256 \
  --temperature 0.0 \
  --device mps \
  --seed 0