--- base_model: JetBrains/Mellum2-12B-A2.5B-Instruct base_model_relation: quantized library_name: gguf pipeline_tag: text-generation language: - en tags: - mellum - gguf - llama.cpp - quantized - moe - instruct license: apache-2.0 --- # Mellum2 Instruct — GGUF (Q4_K_M) This repository contains a **GGUF Q4_K_M** quantization of [`JetBrains/Mellum2-12B-A2.5B-Instruct`](https://huggingface.co/JetBrains/Mellum2-12B-A2.5B-Instruct), ready to run with [`llama.cpp`](https://github.com/ggml-org/llama.cpp), Ollama, LM Studio, and other GGUF-compatible runtimes. **This quantization (Q4_K_M):** 4-bit k-quant (medium). Strong quality/size trade-off (KLD ~0.106, 87% top-token agreement) — a good default. | File | Size | |---|---| | `Mellum2-12B-A2.5B-Instruct-Q4_K_M.gguf` | 8.1 GB | Mellum 2 Instruct is a Mixture-of-Experts assistant model (64 experts, 8 activated per token, 131,072-token context) that answers directly, without an externalized chain of thought. For the full model description, evaluation results, and architecture details, see the original model card: **[JetBrains/Mellum2-12B-A2.5B-Instruct](https://huggingface.co/JetBrains/Mellum2-12B-A2.5B-Instruct)**. ## Available quantizations | Quantization | Description | Size | KLD vs BF16 ↓ | Top-token match ↑ | |---|---|---|---|---| | [`BF16`](https://huggingface.co/JetBrains/Mellum2-12B-A2.5B-Instruct-GGUF-BF16) | 16-bit, no quantization (reference) | 24.3 GB | — | — | | [`Q8_0`](https://huggingface.co/JetBrains/Mellum2-12B-A2.5B-Instruct-GGUF-Q8_0) | 8-bit, effectively lossless | 12.9 GB | 0.016 | 95.2% | | [`Q6_K`](https://huggingface.co/JetBrains/Mellum2-12B-A2.5B-Instruct-GGUF-Q6_K) | 6-bit k-quant, very high quality | 10.9 GB | 0.038 | 92.9% | | **`Q4_K_M` (this repo)** | 4-bit k-quant, balanced (recommended) | 8.1 GB | 0.106 | 87.2% | | [`MXFP4_MOE`](https://huggingface.co/JetBrains/Mellum2-12B-A2.5B-Instruct-GGUF-MXFP4_MOE) | MXFP4 4-bit on MoE experts, smallest | 7.0 GB | 0.166 | 84.2% | KL divergence and top-token agreement are measured against the BF16 logits on Wikitext-2 (`n_ctx=512`); lower KLD / higher agreement means closer to the unquantized model. (Perplexity is omitted here — it is unreliable for instruction-tuned models on Wikitext-2, which is out of distribution.) ## Download ```sh hf download JetBrains/Mellum2-12B-A2.5B-Instruct-GGUF-Q4_K_M Mellum2-12B-A2.5B-Instruct-Q4_K_M.gguf --local-dir . ``` ## Run with llama.cpp ```sh # Pull and serve in one step (downloads the GGUF automatically) llama-server -hf JetBrains/Mellum2-12B-A2.5B-Instruct-GGUF-Q4_K_M \ --ctx-size 131072 \ --temp 0.6 --top-p 0.95 --top-k 20 # Or run a one-off prompt with a local file llama-cli -m Mellum2-12B-A2.5B-Instruct-Q4_K_M.gguf \ --ctx-size 131072 \ --temp 0.6 --top-p 0.95 --top-k 20 \ -p "Write a Python function to reverse a string." ``` The server exposes an OpenAI-compatible API on `http://localhost:8080/v1`: ```python from openai import OpenAI client = OpenAI(base_url="http://localhost:8080/v1", api_key="llama.cpp") chat_response = client.chat.completions.create( model="JetBrains/Mellum2-12B-A2.5B-Instruct-GGUF-Q4_K_M", messages=[ {"role": "user", "content": "Write a Python function to reverse a string."}, ], max_tokens=81920, temperature=0.6, top_p=0.95, extra_body={"top_k": 20}, ) print(chat_response.choices[0].message.content) ``` ## Run with Ollama ```sh ollama run hf.co/JetBrains/Mellum2-12B-A2.5B-Instruct-GGUF-Q4_K_M ``` ## License Released under the Apache 2.0 license. --- *For the full model card, evaluation results, and architecture details, refer to the original model: [JetBrains/Mellum2-12B-A2.5B-Instruct](https://huggingface.co/JetBrains/Mellum2-12B-A2.5B-Instruct).*