--- model-index: - name: >- Granite-4.0-H-Tiny-MoE โ€” MLX (Apple Silicon), **5-bit** (with guidance for 2/3/4/6-bit) results: [] license: apache-2.0 language: - en tags: - ibm - granite - mlx - apple-silicon - mamba2 - transformer - hybrid - moe - long-context - instruct - quantized - 5bit - MoE pipeline_tag: text-generation library_name: mlx base_model: - ibm-granite/granite-4.0-h-tiny --- # Granite-4.0-H-Tiny โ€” **MLX 5-bit** (Apple Silicon) **Maintainer / Publisher:** [**Susant Achary**](https://huggingface.co/Susant-Achary) This repository provides an **Apple-Silicon MLX build** of **IBM Granite-4.0-H-Tiny** quantized to **5-bit**. If you need **more faithfulness** than 3/4-bit but want **lower RAM** than 6-bit, 5-bit is a strong middle groundโ€”especially for **document parsing, structured extraction, and long-context** assistants on Mac. --- ## ๐Ÿ”Ž About Granite 4.0 (context) - **Architecture:** Hybrid **Mamba-2 + softmax attention**; *H* tiers add **Mixture-of-Experts (MoE)** (sparse activation per token). - **Tier:** **H-Tiny** (~7B total params with ~1B active via MoE), designed for **efficient long-context** inference. - **License:** **Apache-2.0** (permissive, enterprise-friendly). - **Typical uses:** Instruction following, long-context assistants, RAG pipelines, structured outputs. > This card documents the **MLX 5-bit** conversion. See the comparison table below for when to choose 3/4-bit (lower RAM) or 6-bit (highest fidelity). --- ## ๐Ÿ“ฆ Whatโ€™s in this repo (MLX format) - `config.json` (MLX), `mlx_model*.safetensors` (**5-bit** shards) - Tokenizer files: `tokenizer.json`, `tokenizer_config.json` - Model metadata (e.g., `model_index.json`) Target platform: **macOS** on **Apple Silicon (M-series)** with **Metal/MPS** acceleration. --- ## โœ… Intended use - **High-quality** instruction following and summarization with **long context** - **Document / form / table** parsing and **JSON** extraction (schema-guided prompts) - **On-device** prototyping where accuracy matters but RAM is modest ## โš ๏ธ Limitations - Still quantized: some regressions vs FP16 can surface on intricate math/code. - KV cache / context length can dominate RAM at very long windowsโ€”monitor budgets. - Add your own **guardrails and safety** for production. --- ## ๐Ÿ”ข Choosing a quantization level (MLX on Apple Silicon) Indicative ranges for a ~7B hybrid MoE LM (actual usage varies by context length and batch size). | Variant | Typical Peak RAM | Relative Speed | Typical Behavior | When to Choose | |---|---:|:---:|---|---| | **2-bit** | ~3โ€“4 GB | ๐Ÿ”ฅ๐Ÿ”ฅ๐Ÿ”ฅ๐Ÿ”ฅ | Smallest, most lossy | Minimal RAM devices; smoke tests | | **3-bit** | ~5โ€“6 GB | **๐Ÿ”ฅ๐Ÿ”ฅ๐Ÿ”ฅ๐Ÿ”ฅ** | Direct, concise | Great default on M1/M2/M3/M4 | | **4-bit** | ~6โ€“7.5 GB | ๐Ÿ”ฅ๐Ÿ”ฅ๐Ÿ”ฅ | Better detail retention vs 3-bit | If 3-bit misses small details | | **5-bit** *(this repo)* | **~8โ€“9 GB** | **๐Ÿ”ฅ๐Ÿ”ฅโ˜†** | **Higher fidelity**, fewer omissions | When you want stronger document/JSON faithfulness without 6-bit RAM | | **6-bit** | ~9.5โ€“11 GB | ๐Ÿ”ฅ๐Ÿ”ฅ | Highest MLX fidelity | If RAM permits and you need maximum quality | **Rules of thumb** - Start at **5-bit** for **document/structured** tasks on 8โ€“16 GB Macs. - Drop to **3/4-bit** for tighter RAM / higher speed. - Move to **6-bit** if you still see omissions or slight distortions in outputs. --- ## ๐Ÿš€ Quickstart (CLI โ€” MLX) **Deterministic generation** ```bash python -m mlx_lm.generate \ --model \ --prompt "Summarize the following in 5 bullet points:\n" \ --max-tokens 256 \ --temperature 0.0 \ --device mps \ --seed 0