--- model-index: - name: >- Granite-4.0-H-Tiny β€” MLX (Apple Silicon), **6-bit** (with guidance for 2/3/4/5-bit) results: [] license: apache-2.0 language: - en tags: - ibm - granite - mlx - apple-silicon - mamba2 - transformer - hybrid - moe - long-context - instruct - quantized - 6bit - MoE pipeline_tag: text-generation library_name: mlx base_model: - ibm-granite/granite-4.0-h-tiny --- # Granite-4.0-H-Tiny β€” **MLX 6-bit** (Apple Silicon) **Maintainer / Publisher:** [**Susant Achary**](https://huggingface.co/Susant-Achary) This repository provides an **Apple-Silicon MLX build** of **IBM Granite-4.0-H-Tiny** quantized to **6-bit**. Among MLX quant variants, **6-bit** offers the **highest fidelity** while still fitting comfortably on modern M-series Macs. If your workload involves **precise extraction, structured outputs, or long contexts**, 6-bit is usually the best on-device choice. --- ## πŸ”’ Choosing a quantization level (LMX variants) Use this table as a **practical** guide for a ~7B hybrid MoE LM on Apple Silicon. (Figures vary by device/context.) | Variant | Typical Peak RAM | Relative Speed | Typical Behavior | When to Choose | |---|---:|:---:|---|---| | **2-bit** | ~3–4 GB | πŸ”₯πŸ”₯πŸ”₯πŸ”₯ | Smallest, most lossy | Minimal RAM devices; smoke tests | | **3-bit** | ~5–6 GB | **πŸ”₯πŸ”₯πŸ”₯πŸ”₯** | Direct, concise | Great default on M1/M2/M3/M4 | | **4-bit** | ~6–7.5 GB | πŸ”₯πŸ”₯πŸ”₯ | Better detail retention | If 3-bit misses details | | **5-bit** | ~8–9 GB | πŸ”₯πŸ”₯β˜† | Higher fidelity | Heavier docs/structured outputs | | **6-bit** *(this repo)* | **~9.5–11 GB** | πŸ”₯πŸ”₯ | **Highest MLX fidelity** | Best quality on-device if RAM permits | **Tips** - Prefer **6-bit** when you have ~10–12 GB free and want maximum quality. - Use **3-bit/4-bit** for tighter RAM with good latency and strong baseline quality. - For JSON/structured extraction, consider **temperature 0.0** and **schema-style prompts**. --- ## πŸ”Ž About Granite 4.0 (context for this build) - **Architecture:** Hybrid **Mamba-2 + softmax attention**; *H* tiers add **Mixture-of-Experts (MoE)** for sparse activation and efficiency. - **Model tier:** **H-Tiny** (~7B total params with ~1B active via MoE) β€” designed for **long-context** use and efficient serving. - **License:** **Apache-2.0** (permissive, enterprise-friendly). - **Use cases:** Instruction following, long-context assistants, RAG backends, structured outputs. > This card documents the **MLX 6-bit** conversion. For lower-RAM devices, see the 2/3/4/5-bit guidance below. --- ## πŸ“¦ Contents of this repository - `config.json` (MLX), `mlx_model*.safetensors` (**6-bit** shards) - Tokenizer files: `tokenizer.json`, `tokenizer_config.json` - Any auxiliary metadata (e.g., `model_index.json`) This build targets **macOS** on **Apple Silicon (M-series)** using **Metal/MPS**. --- ## βœ… Intended use - **High-fidelity** instruction following and summarization - **Long-context** reasoning and retrieval-augmented generation (RAG) - **Structured extraction** (JSON, key–value) and document parsing - On-device prototyping where **answer faithfulness** matters ## ⚠️ Limitations - As with any quantization, small regressions vs FP16 can occur (complex math/code). - **Token limits** and **KV-cache growth** still apply for very long contexts. - Always add your own **guardrails/safety** for sensitive deployments. ## πŸš€ Quickstart (CLI β€” MLX) **Deterministic generation** ```bash python -m mlx_lm.generate \ --model \ --prompt "Summarize the following meeting notes in 5 bullet points:\n" \ --max-tokens 256 \ --temperature 0.0 \ --device mps \ --seed 0