pavlichenko commited on
Commit
1236b41
·
verified ·
1 Parent(s): 9ddf6fd

Add README with usage and quantization quality metrics

Browse files
Files changed (1) hide show
  1. README.md +103 -0
README.md CHANGED
@@ -1,3 +1,106 @@
1
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  license: apache-2.0
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ base_model: JetBrains/Mellum2-12B-A2.5B-Instruct
3
+ base_model_relation: quantized
4
+ library_name: gguf
5
+ pipeline_tag: text-generation
6
+ language:
7
+ - en
8
+ tags:
9
+ - mellum
10
+ - gguf
11
+ - llama.cpp
12
+ - quantized
13
+ - moe
14
+ - instruct
15
  license: apache-2.0
16
  ---
17
+
18
+ # Mellum2 Instruct — GGUF (Q4_K_M)
19
+
20
+ This repository contains a **GGUF Q4_K_M** quantization of
21
+ [`JetBrains/Mellum2-12B-A2.5B-Instruct`](https://huggingface.co/JetBrains/Mellum2-12B-A2.5B-Instruct), ready to run with
22
+ [`llama.cpp`](https://github.com/ggml-org/llama.cpp), Ollama, LM Studio, and
23
+ other GGUF-compatible runtimes.
24
+
25
+ **This quantization (Q4_K_M):** 4-bit k-quant (medium). Strong quality/size trade-off (KLD ~0.106, 87% top-token agreement) — a good default.
26
+
27
+ | File | Size |
28
+ |---|---|
29
+ | `Mellum2-12B-A2.5B-Instruct-Q4_K_M.gguf` | 8.1 GB |
30
+
31
+ Mellum 2 Instruct is a Mixture-of-Experts assistant model (64 experts, 8
32
+ activated per token, 131,072-token context) that answers directly, without an
33
+ externalized chain of thought. For the full model description, evaluation
34
+ results, and architecture details, see the original model card:
35
+ **[JetBrains/Mellum2-12B-A2.5B-Instruct](https://huggingface.co/JetBrains/Mellum2-12B-A2.5B-Instruct)**.
36
+
37
+ ## Available quantizations
38
+
39
+ | Quantization | Description | Size | KLD vs BF16 ↓ | Top-token match ↑ |
40
+ |---|---|---|---|---|
41
+ | [`BF16`](https://huggingface.co/JetBrains/Mellum2-12B-A2.5B-Instruct-GGUF-BF16) | 16-bit, no quantization (reference) | 24.3 GB | — | — |
42
+ | [`Q8_0`](https://huggingface.co/JetBrains/Mellum2-12B-A2.5B-Instruct-GGUF-Q8_0) | 8-bit, effectively lossless | 12.9 GB | 0.016 | 95.2% |
43
+ | [`Q6_K`](https://huggingface.co/JetBrains/Mellum2-12B-A2.5B-Instruct-GGUF-Q6_K) | 6-bit k-quant, very high quality | 10.9 GB | 0.038 | 92.9% |
44
+ | **`Q4_K_M` (this repo)** | 4-bit k-quant, balanced (recommended) | 8.1 GB | 0.106 | 87.2% |
45
+ | [`MXFP4_MOE`](https://huggingface.co/JetBrains/Mellum2-12B-A2.5B-Instruct-GGUF-MXFP4_MOE) | MXFP4 4-bit on MoE experts, smallest | 7.0 GB | 0.166 | 84.2% |
46
+
47
+ KL divergence and top-token agreement are measured against the BF16 logits on
48
+ Wikitext-2 (`n_ctx=512`); lower KLD / higher agreement means closer to the
49
+ unquantized model. (Perplexity is omitted here — it is unreliable for
50
+ instruction-tuned models on Wikitext-2, which is out of distribution.)
51
+
52
+ ## Download
53
+
54
+ ```sh
55
+ hf download JetBrains/Mellum2-12B-A2.5B-Instruct-GGUF-Q4_K_M Mellum2-12B-A2.5B-Instruct-Q4_K_M.gguf --local-dir .
56
+ ```
57
+
58
+ ## Run with llama.cpp
59
+
60
+ ```sh
61
+ # Pull and serve in one step (downloads the GGUF automatically)
62
+ llama-server -hf JetBrains/Mellum2-12B-A2.5B-Instruct-GGUF-Q4_K_M \
63
+ --ctx-size 131072 \
64
+ --temp 0.6 --top-p 0.95 --top-k 20
65
+
66
+ # Or run a one-off prompt with a local file
67
+ llama-cli -m Mellum2-12B-A2.5B-Instruct-Q4_K_M.gguf \
68
+ --ctx-size 131072 \
69
+ --temp 0.6 --top-p 0.95 --top-k 20 \
70
+ -p "Write a Python function to reverse a string."
71
+ ```
72
+
73
+ The server exposes an OpenAI-compatible API on `http://localhost:8080/v1`:
74
+
75
+ ```python
76
+ from openai import OpenAI
77
+
78
+ client = OpenAI(base_url="http://localhost:8080/v1", api_key="llama.cpp")
79
+
80
+ chat_response = client.chat.completions.create(
81
+ model="JetBrains/Mellum2-12B-A2.5B-Instruct-GGUF-Q4_K_M",
82
+ messages=[
83
+ {"role": "user", "content": "Write a Python function to reverse a string."},
84
+ ],
85
+ max_tokens=81920,
86
+ temperature=0.6,
87
+ top_p=0.95,
88
+ extra_body={"top_k": 20},
89
+ )
90
+ print(chat_response.choices[0].message.content)
91
+ ```
92
+
93
+ ## Run with Ollama
94
+
95
+ ```sh
96
+ ollama run hf.co/JetBrains/Mellum2-12B-A2.5B-Instruct-GGUF-Q4_K_M
97
+ ```
98
+
99
+ ## License
100
+
101
+ Released under the Apache 2.0 license.
102
+
103
+ ---
104
+
105
+ *For the full model card, evaluation results, and architecture details, refer to
106
+ the original model: [JetBrains/Mellum2-12B-A2.5B-Instruct](https://huggingface.co/JetBrains/Mellum2-12B-A2.5B-Instruct).*