Mellum2-12B-A2.5B-Instruct-mlx-8bit

This is an 8-bit MLX quantization of JetBrains/Mellum2-12B-A2.5B-Instruct, the instruction-tuned Mixture-of-Experts coding assistant from JetBrains. It is derived from the full-precision jedisct1/Mellum2-12B-A2.5B-Instruct-mlx conversion.

Every weight is quantized to 8 bits with a group size of 64 (about 8.5 bits per weight overall). At 8 bits the output is effectively indistinguishable from the bfloat16 model, so this is the quantization to reach for when you want the original quality at roughly half the memory.

Unlike its sibling Thinking model, the Instruct model answers directly without a <think> reasoning block. Mellum 2 uses 64 experts with 8 active per token (about 2.5B active parameters out of 12B), a mix of sliding-window and full-attention layers, and a 131,072-token context window.

Tool calling was verified end to end against a live mlx_lm.server driven by the swival agent harness, run side by side with the full-precision model: the 8-bit weights matched it exactly, issuing well-formed read_file, edit_file, write_file, list_files, and shell-command calls with no malformed calls. Generation stops cleanly on <|im_end|> (the eos_token_id is set to [0, 28], which is what lets agent harnesses see a proper tool_calls finish reason).

Requirements

The mellum architecture is not supported by the stock mlx-lm code yet.

Until it is supported upstream, install this fork of mlx-lm from source:

pip install git+https://github.com/jedisct1/mlx-lm

Or run it directly with uv:

uvx --from git+https://github.com/jedisct1/mlx-lm mlx_lm.server

Use with mlx-lm

Quick test:

uvx --from git+https://github.com/jedisct1/mlx-lm \
  mlx_lm.generate --model jedisct1/Mellum2-12B-A2.5B-Instruct-mlx-8bit \
  --prompt "Write a Python function that reverses a linked list." \
  --max-tokens 16384 \
  --temp 0.6 --top-p 0.95 --top-k 20

Starting the server:

uvx --from git+https://github.com/jedisct1/mlx-lm \
  mlx_lm.server --model jedisct1/Mellum2-12B-A2.5B-Instruct-mlx-8bit \
  --max-tokens 16384 \
  --temp 0.6 --top-p 0.95 --top-k 20

The recommended sampling settings from JetBrains are temperature=0.6, top_p=0.95, top_k=20.

Using this setup with the Swival.dev harness

Install swival.dev:

uv tool install swival

Then point it at the running server:

swival --provider llamacpp --model jedisct1/Mellum2-12B-A2.5B-Instruct-mlx-8bit

License

Apache 2.0, inherited from the original model.

Downloads last month
297
Safetensors
Model size
12B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 2 Ask for provider support

Model tree for jedisct1/Mellum2-12B-A2.5B-Instruct-mlx-8bit

Quantized
(16)
this model

Collection including jedisct1/Mellum2-12B-A2.5B-Instruct-mlx-8bit