---
license: other
license_name: meralion-public-license
license_link: https://huggingface.co/MERaLiON/MERaLiON-3-10B/blob/main/LICENSE
base_model: aisingapore/MERaLiON-3-10B
tags:
  - turboquant
  - kv-cache-quantization
  - meralion
  - whisper
  - audio
  - speech-recognition
  - sea-lion
  - quantized
library_name: transformers
pipeline_tag: automatic-speech-recognition
---

# MERaLiON-3-10B-TurboQuant

**TurboQuant KV cache compression** for [aisingapore/MERaLiON-3-10B](https://huggingface.co/aisingapore/MERaLiON-3-10B).

This is a **documentation repository** that explains how to combine MERaLiON-3-10B's weights with TurboQuant inference-time KV cache compression. No weights are stored here — use the base model directly and apply TurboQuant via the Python package or llama.cpp fork.

## Hardware compatibility

| Device | VRAM / RAM | Recommendation |
| --- | --- | --- |
| Any host that runs the base model | baseline + runtime savings | RotorQuant/TurboQuant is a KV-cache runtime modifier; pair with any weight variant |

## What is this?

KV cache compression reduces the memory used by the attention cache during inference. Unlike weight quantization (which is baked into the GGUF/MLX file), KV cache compression is applied at runtime — so the same base weights can be used with or without compression.

| Technique | Where it's applied | Savings |
|-----------|-------------------|---------|
| Weight quantization (GGUF/MLX/AWQ) | Baked into model file | Reduces disk + weight memory |
| **TurboQuant KV cache** | At inference time | Reduces attention memory (critical for long context) |

Both can be combined for maximum efficiency.

## Quickstart

### Option A — Python / transformers

Install the `turboquant` package:

```bash
pip install turboquant
```

Then use it with the base model:

```python
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoTokenizer
from turboquant import TurboQuantCache

tokenizer = AutoTokenizer.from_pretrained("aisingapore/MERaLiON-3-10B", trust_remote_code=True)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    "aisingapore/MERaLiON-3-10B",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

# Apply TurboQuant to the KV cache
cache = TurboQuantCache(bits=4)  # or bits=2 for more aggressive compression

inputs = tokenizer("Hello, how are you?", return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=128,
    past_key_values=cache,
    use_cache=True,
)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))
```


## Model Specifications

| Property | Value |
|----------|-------|
| Base Model | [aisingapore/MERaLiON-3-10B](https://huggingface.co/aisingapore/MERaLiON-3-10B) |
| Architecture | Whisper encoder (weighted-sum) + Gemma-2 decoder |
| Parameters | ~10B (audio encoder + text decoder) |
| Context Length | 8K |
| BF16 Size | ~20 GB |
| Modalities | Audio + Text |
| License | other |

## What is TurboQuant?

[TurboQuant](https://arxiv.org/abs/2504.19874) (ICLR 2026) applies random orthogonal rotations followed by optimal scalar quantization to the KV cache. Bit-identical prefill logits at 4-bit, up to 4-8× memory savings for long sequences.

**Benchmarks** (from the TurboQuant repository, Llama 3.1 8B on RTX 5090 — results vary by model and hardware):

- 4-bit KV cache: bit-identical prefill logits
- ~1.4-1.7× speedup on Apple Silicon
- Up to 8× KV memory savings

> Benchmarks are from the TurboQuant repository using Llama 3.1 8B. Performance on MERaLiON-3-10B will differ. Please open a discussion if you have independent results.

## Current Ecosystem Support

| Runtime | TurboQuant Support | Notes |
|---------|----------------------|-------|
| Python transformers + `turboquant` | ✅ Full | Drop-in cache class |
| llama.cpp upstream | ❌ Not merged | Use fork below |
| llama-cpp-turboquant fork | ✅ `planar3`, `iso3` | [GitHub](https://github.com/johndpope/llama-cpp-turboquant/tree/feature/planarquant-kv-cache) |
| LM Studio | ❌ [Requested](https://github.com/lmstudio-ai/lmstudio-bug-tracker/issues/1719) | Use `q8_0` as alternative |
| Ollama | ❌ Not supported | Use `OLLAMA_KV_CACHE_TYPE=q8_0` |
| vLLM | ❌ Not supported | — |
| koboldcpp | ❌ Not supported | — |

## Pre-quantized weight variants

If you want combined weight + KV cache compression, majentik hosts pre-quantized versions:

- [MLX (Apple Silicon)](https://huggingface.co/majentik?search=MERaLiON-3-10B+MLX)
- [GGUF (llama.cpp / Ollama / LM Studio)](https://huggingface.co/majentik?search=MERaLiON-3-10B+GGUF)

## See Also

- [RotorQuant GitHub](https://github.com/scrya-com/rotorquant)
- [TurboQuant paper (arXiv 2504.19874)](https://arxiv.org/abs/2504.19874)
- [llama-cpp-turboquant fork](https://github.com/johndpope/llama-cpp-turboquant/tree/feature/planarquant-kv-cache)
- [Base model: aisingapore/MERaLiON-3-10B](https://huggingface.co/aisingapore/MERaLiON-3-10B)

## Variants in this family

(Showing 8 sibling variants under `majentik/meralion3-10b-*`. The current variant — `TurboQuant` — is **bolded**.)

| Variant | Runtime | Approx size | Use case |
|---|---|---|---|
| [RotorQuant](https://huggingface.co/majentik/meralion3-10b-rotorquant) | runtime modifier | n/a | KV-cache root (weight-agnostic) |
| [RotorQuant-MLX-2bit](https://huggingface.co/majentik/meralion3-10b-rotorquant-mlx-2bit) | mlx-lm | ~3.2 GB | Apple Silicon, smallest |
| [RotorQuant-MLX-4bit](https://huggingface.co/majentik/meralion3-10b-rotorquant-mlx-4bit) | mlx-lm | ~6.2 GB | Apple Silicon balanced |
| [RotorQuant-MLX-8bit](https://huggingface.co/majentik/meralion3-10b-rotorquant-mlx-8bit) | mlx-lm | ~12 GB | Apple Silicon reference |
| **TurboQuant** | runtime modifier | n/a | KV-cache root (weight-agnostic) |
| [TurboQuant-MLX-2bit](https://huggingface.co/majentik/meralion3-10b-turboquant-mlx-2bit) | mlx-lm | ~3.2 GB | Apple Silicon, smallest |
| [TurboQuant-MLX-4bit](https://huggingface.co/majentik/meralion3-10b-turboquant-mlx-4bit) | mlx-lm | ~6.2 GB | Apple Silicon balanced |
| [TurboQuant-MLX-8bit](https://huggingface.co/majentik/meralion3-10b-turboquant-mlx-8bit) | mlx-lm | ~12 GB | Apple Silicon reference |