---
base_model: JetBrains/Mellum2-12B-A2.5B-Instruct
base_model_relation: quantized
library_name: gguf
pipeline_tag: text-generation
language:
- en
tags:
- mellum
- gguf
- llama.cpp
- quantized
- moe
- instruct
license: apache-2.0
---

# Mellum2 Instruct — GGUF (Q4_K_M)

This repository contains a **GGUF Q4_K_M** quantization of
[`JetBrains/Mellum2-12B-A2.5B-Instruct`](https://huggingface.co/JetBrains/Mellum2-12B-A2.5B-Instruct), ready to run with
[`llama.cpp`](https://github.com/ggml-org/llama.cpp), Ollama, LM Studio, and
other GGUF-compatible runtimes.

**This quantization (Q4_K_M):** 4-bit k-quant (medium). Strong quality/size trade-off (KLD ~0.106, 87% top-token agreement) — a good default.

| File | Size |
|---|---|
| `Mellum2-12B-A2.5B-Instruct-Q4_K_M.gguf` | 8.1 GB |

Mellum 2 Instruct is a Mixture-of-Experts assistant model (64 experts, 8
activated per token, 131,072-token context) that answers directly, without an
externalized chain of thought. For the full model description, evaluation
results, and architecture details, see the original model card:
**[JetBrains/Mellum2-12B-A2.5B-Instruct](https://huggingface.co/JetBrains/Mellum2-12B-A2.5B-Instruct)**.

## Available quantizations

| Quantization | Description | Size | KLD vs BF16 ↓ | Top-token match ↑ |
|---|---|---|---|---|
| [`BF16`](https://huggingface.co/JetBrains/Mellum2-12B-A2.5B-Instruct-GGUF-BF16) | 16-bit, no quantization (reference) | 24.3 GB | — | — |
| [`Q8_0`](https://huggingface.co/JetBrains/Mellum2-12B-A2.5B-Instruct-GGUF-Q8_0) | 8-bit, effectively lossless | 12.9 GB | 0.016 | 95.2% |
| [`Q6_K`](https://huggingface.co/JetBrains/Mellum2-12B-A2.5B-Instruct-GGUF-Q6_K) | 6-bit k-quant, very high quality | 10.9 GB | 0.038 | 92.9% |
| **`Q4_K_M` (this repo)** | 4-bit k-quant, balanced (recommended) | 8.1 GB | 0.106 | 87.2% |
| [`MXFP4_MOE`](https://huggingface.co/JetBrains/Mellum2-12B-A2.5B-Instruct-GGUF-MXFP4_MOE) | MXFP4 4-bit on MoE experts, smallest | 7.0 GB | 0.166 | 84.2% |

KL divergence and top-token agreement are measured against the BF16 logits on
Wikitext-2 (`n_ctx=512`); lower KLD / higher agreement means closer to the
unquantized model. (Perplexity is omitted here — it is unreliable for
instruction-tuned models on Wikitext-2, which is out of distribution.)

## Download

```sh
hf download JetBrains/Mellum2-12B-A2.5B-Instruct-GGUF-Q4_K_M Mellum2-12B-A2.5B-Instruct-Q4_K_M.gguf --local-dir .
```

## Run with llama.cpp

```sh
# Pull and serve in one step (downloads the GGUF automatically)
llama-server -hf JetBrains/Mellum2-12B-A2.5B-Instruct-GGUF-Q4_K_M \
  --ctx-size 131072 \
  --temp 0.6 --top-p 0.95 --top-k 20

# Or run a one-off prompt with a local file
llama-cli -m Mellum2-12B-A2.5B-Instruct-Q4_K_M.gguf \
  --ctx-size 131072 \
  --temp 0.6 --top-p 0.95 --top-k 20 \
  -p "Write a Python function to reverse a string."
```

The server exposes an OpenAI-compatible API on `http://localhost:8080/v1`:

```python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="llama.cpp")

chat_response = client.chat.completions.create(
    model="JetBrains/Mellum2-12B-A2.5B-Instruct-GGUF-Q4_K_M",
    messages=[
        {"role": "user", "content": "Write a Python function to reverse a string."},
    ],
    max_tokens=81920,
    temperature=0.6,
    top_p=0.95,
    extra_body={"top_k": 20},
)
print(chat_response.choices[0].message.content)
```

## Run with Ollama

```sh
ollama run hf.co/JetBrains/Mellum2-12B-A2.5B-Instruct-GGUF-Q4_K_M
```

## License

Released under the Apache 2.0 license.

---

*For the full model card, evaluation results, and architecture details, refer to
the original model: [JetBrains/Mellum2-12B-A2.5B-Instruct](https://huggingface.co/JetBrains/Mellum2-12B-A2.5B-Instruct).*