How to use from
Docker Model Runner
docker model run hf.co/ji-farthing/Mellum2-12B-A2.5B-Instruct-ik-llama-GGUF:BF16
Quick Links

Mellum2 12B A2.5B Instruct GGUF for ik_llama

This repository contains GGUF conversions of JetBrains/Mellum2-12B-A2.5B-Instruct.

The files were converted with an ik_llama.cpp branch that adds Mellum2 architecture support and emits the Mellum sliding-window and RoPE/YARN metadata needed by GGUF runtimes.

These files are intended as persistent convenience artifacts for ik_llama reviewers and users. They should also run on current llama.cpp builds that support the Mellum architecture.

No performance or model-quality claims are made here.

Files

File Type SHA-256
Mellum2-12B-A2.5B-Instruct-ik-llama-BF16.gguf BF16 reference conversion 6a322a3f6c59cdd9b4eee3ea678d964572d4b3dc07e52965f235823013d352e0
Mellum2-12B-A2.5B-Instruct-ik-llama-Q8_0.gguf Q8_0 quantization a7db12ebf1e0567927b5a7433dafe98535fd3b75ead9e23f008f1219a6bc90bb

Provenance

The embedded chat template is the stock JetBrains Instruct template:

tokenizer.chat_template sha256 = e674cbec4c384ab50c18c91d8cada3b6931d7a7ee25d9db004366aa440c1ca86

The converted GGUF metadata includes:

  • mellum.attention.sliding_window = 1024
  • mellum.attention.sliding_window_pattern
  • mellum.rope.freq_base = 500000.0
  • mellum.rope.freq_base_swa = 500000.0
  • mellum.rope.scaling.type = yarn
  • mellum.rope.scaling.factor = 16.0
  • mellum.rope.scaling.original_context_length = 8192
  • mellum.rope.scaling.yarn_attn_factor = 1.2772588729858398
  • mellum.rope.scaling.yarn_beta_fast = 32.0
  • mellum.rope.scaling.yarn_beta_slow = 1.0

Local Validation

The BF16 and Q8_0 files were smoke-tested locally on an RTX 4070 with CUDA server builds.

Validation included:

  • Q8_0 with ik_llama.cpp CUDA server and --cpu-moe
  • Q8_0 with current llama.cpp upstream CUDA server and --cpu-moe
  • BF16 with current llama.cpp upstream CUDA server and --cpu-moe
  • OpenAI-compatible chat completion request using the embedded chat template
  • deterministic long-code prompt
  • python3 -m py_compile on the extracted code
  • functional topological-sort test including cycle detection

The long-code smoke is a runtime sanity check only. It is not a benchmark and does not imply any quality ranking.

Example

./llama-server \
  -m Mellum2-12B-A2.5B-Instruct-ik-llama-Q8_0.gguf \
  -ngl 99 \
  --cpu-moe \
  -c 4096 \
  -b 512 \
  -ub 512 \
  --jinja

License

The source model card lists the license as Apache-2.0. See the upstream JetBrains model card for the authoritative license and model documentation.

Downloads last month
111
GGUF
Model size
12B params
Architecture
mellum
Hardware compatibility
Log In to add your hardware

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for ji-farthing/Mellum2-12B-A2.5B-Instruct-ik-llama-GGUF

Quantized
(16)
this model