--- license: other license_name: meralion-public-license license_link: https://huggingface.co/MERaLiON/MERaLiON-3-10B/blob/main/LICENSE base_model: aisingapore/MERaLiON-3-10B tags: - turboquant - kv-cache-quantization - meralion - whisper - audio - speech-recognition - sea-lion - quantized library_name: transformers pipeline_tag: automatic-speech-recognition --- # MERaLiON-3-10B-TurboQuant **TurboQuant KV cache compression** for [aisingapore/MERaLiON-3-10B](https://huggingface.co/aisingapore/MERaLiON-3-10B). This is a **documentation repository** that explains how to combine MERaLiON-3-10B's weights with TurboQuant inference-time KV cache compression. No weights are stored here — use the base model directly and apply TurboQuant via the Python package or llama.cpp fork. ## Hardware compatibility | Device | VRAM / RAM | Recommendation | | --- | --- | --- | | Any host that runs the base model | baseline + runtime savings | RotorQuant/TurboQuant is a KV-cache runtime modifier; pair with any weight variant | ## What is this? KV cache compression reduces the memory used by the attention cache during inference. Unlike weight quantization (which is baked into the GGUF/MLX file), KV cache compression is applied at runtime — so the same base weights can be used with or without compression. | Technique | Where it's applied | Savings | |-----------|-------------------|---------| | Weight quantization (GGUF/MLX/AWQ) | Baked into model file | Reduces disk + weight memory | | **TurboQuant KV cache** | At inference time | Reduces attention memory (critical for long context) | Both can be combined for maximum efficiency. ## Quickstart ### Option A — Python / transformers Install the `turboquant` package: ```bash pip install turboquant ``` Then use it with the base model: ```python import torch from transformers import AutoModelForSpeechSeq2Seq, AutoTokenizer from turboquant import TurboQuantCache tokenizer = AutoTokenizer.from_pretrained("aisingapore/MERaLiON-3-10B", trust_remote_code=True) model = AutoModelForSpeechSeq2Seq.from_pretrained( "aisingapore/MERaLiON-3-10B", torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True, ) # Apply TurboQuant to the KV cache cache = TurboQuantCache(bits=4) # or bits=2 for more aggressive compression inputs = tokenizer("Hello, how are you?", return_tensors="pt").to(model.device) outputs = model.generate( **inputs, max_new_tokens=128, past_key_values=cache, use_cache=True, ) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)) ``` ## Model Specifications | Property | Value | |----------|-------| | Base Model | [aisingapore/MERaLiON-3-10B](https://huggingface.co/aisingapore/MERaLiON-3-10B) | | Architecture | Whisper encoder (weighted-sum) + Gemma-2 decoder | | Parameters | ~10B (audio encoder + text decoder) | | Context Length | 8K | | BF16 Size | ~20 GB | | Modalities | Audio + Text | | License | other | ## What is TurboQuant? [TurboQuant](https://arxiv.org/abs/2504.19874) (ICLR 2026) applies random orthogonal rotations followed by optimal scalar quantization to the KV cache. Bit-identical prefill logits at 4-bit, up to 4-8× memory savings for long sequences. **Benchmarks** (from the TurboQuant repository, Llama 3.1 8B on RTX 5090 — results vary by model and hardware): - 4-bit KV cache: bit-identical prefill logits - ~1.4-1.7× speedup on Apple Silicon - Up to 8× KV memory savings > Benchmarks are from the TurboQuant repository using Llama 3.1 8B. Performance on MERaLiON-3-10B will differ. Please open a discussion if you have independent results. ## Current Ecosystem Support | Runtime | TurboQuant Support | Notes | |---------|----------------------|-------| | Python transformers + `turboquant` | ✅ Full | Drop-in cache class | | llama.cpp upstream | ❌ Not merged | Use fork below | | llama-cpp-turboquant fork | ✅ `planar3`, `iso3` | [GitHub](https://github.com/johndpope/llama-cpp-turboquant/tree/feature/planarquant-kv-cache) | | LM Studio | ❌ [Requested](https://github.com/lmstudio-ai/lmstudio-bug-tracker/issues/1719) | Use `q8_0` as alternative | | Ollama | ❌ Not supported | Use `OLLAMA_KV_CACHE_TYPE=q8_0` | | vLLM | ❌ Not supported | — | | koboldcpp | ❌ Not supported | — | ## Pre-quantized weight variants If you want combined weight + KV cache compression, majentik hosts pre-quantized versions: - [MLX (Apple Silicon)](https://huggingface.co/majentik?search=MERaLiON-3-10B+MLX) - [GGUF (llama.cpp / Ollama / LM Studio)](https://huggingface.co/majentik?search=MERaLiON-3-10B+GGUF) ## See Also - [RotorQuant GitHub](https://github.com/scrya-com/rotorquant) - [TurboQuant paper (arXiv 2504.19874)](https://arxiv.org/abs/2504.19874) - [llama-cpp-turboquant fork](https://github.com/johndpope/llama-cpp-turboquant/tree/feature/planarquant-kv-cache) - [Base model: aisingapore/MERaLiON-3-10B](https://huggingface.co/aisingapore/MERaLiON-3-10B) ## Variants in this family (Showing 8 sibling variants under `majentik/meralion3-10b-*`. The current variant — `TurboQuant` — is **bolded**.) | Variant | Runtime | Approx size | Use case | |---|---|---|---| | [RotorQuant](https://huggingface.co/majentik/meralion3-10b-rotorquant) | runtime modifier | n/a | KV-cache root (weight-agnostic) | | [RotorQuant-MLX-2bit](https://huggingface.co/majentik/meralion3-10b-rotorquant-mlx-2bit) | mlx-lm | ~3.2 GB | Apple Silicon, smallest | | [RotorQuant-MLX-4bit](https://huggingface.co/majentik/meralion3-10b-rotorquant-mlx-4bit) | mlx-lm | ~6.2 GB | Apple Silicon balanced | | [RotorQuant-MLX-8bit](https://huggingface.co/majentik/meralion3-10b-rotorquant-mlx-8bit) | mlx-lm | ~12 GB | Apple Silicon reference | | **TurboQuant** | runtime modifier | n/a | KV-cache root (weight-agnostic) | | [TurboQuant-MLX-2bit](https://huggingface.co/majentik/meralion3-10b-turboquant-mlx-2bit) | mlx-lm | ~3.2 GB | Apple Silicon, smallest | | [TurboQuant-MLX-4bit](https://huggingface.co/majentik/meralion3-10b-turboquant-mlx-4bit) | mlx-lm | ~6.2 GB | Apple Silicon balanced | | [TurboQuant-MLX-8bit](https://huggingface.co/majentik/meralion3-10b-turboquant-mlx-8bit) | mlx-lm | ~12 GB | Apple Silicon reference |