Instructions to use jedisct1/Mellum2-12B-A2.5B-Instruct-mlx-8bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use jedisct1/Mellum2-12B-A2.5B-Instruct-mlx-8bit with MLX:

# Make sure mlx-lm is installed
# pip install --upgrade mlx-lm

# Generate text with mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("jedisct1/Mellum2-12B-A2.5B-Instruct-mlx-8bit")

prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

text = generate(model, tokenizer, prompt=prompt, verbose=True)

Notebooks
Google Colab
Kaggle
Local Apps Settings
LM Studio

How to use jedisct1/Mellum2-12B-A2.5B-Instruct-mlx-8bit with Pi:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "jedisct1/Mellum2-12B-A2.5B-Instruct-mlx-8bit"

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "mlx-lm": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "jedisct1/Mellum2-12B-A2.5B-Instruct-mlx-8bit"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use jedisct1/Mellum2-12B-A2.5B-Instruct-mlx-8bit with Hermes Agent:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "jedisct1/Mellum2-12B-A2.5B-Instruct-mlx-8bit"

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default jedisct1/Mellum2-12B-A2.5B-Instruct-mlx-8bit

Run Hermes

hermes

MLX LM

How to use jedisct1/Mellum2-12B-A2.5B-Instruct-mlx-8bit with MLX LM:

Generate or start a chat session

# Install MLX LM
uv tool install mlx-lm
# Interactive chat REPL
mlx_lm.chat --model "jedisct1/Mellum2-12B-A2.5B-Instruct-mlx-8bit"

Run an OpenAI-compatible server

# Install MLX LM
uv tool install mlx-lm
# Start the server
mlx_lm.server --model "jedisct1/Mellum2-12B-A2.5B-Instruct-mlx-8bit"
# Calling the OpenAI-compatible server with curl
curl -X POST "http://localhost:8000/v1/chat/completions" \
   -H "Content-Type: application/json" \
   --data '{
     "model": "jedisct1/Mellum2-12B-A2.5B-Instruct-mlx-8bit",
     "messages": [
       {"role": "user", "content": "Hello"}
     ]
   }'

Mellum2-12B-A2.5B-Instruct-mlx-8bit

This is an 8-bit MLX quantization of JetBrains/Mellum2-12B-A2.5B-Instruct, the instruction-tuned Mixture-of-Experts coding assistant from JetBrains. It is derived from the full-precision jedisct1/Mellum2-12B-A2.5B-Instruct-mlx conversion.

Every weight is quantized to 8 bits with a group size of 64 (about 8.5 bits per weight overall). At 8 bits the output is effectively indistinguishable from the bfloat16 model, so this is the quantization to reach for when you want the original quality at roughly half the memory.

Unlike its sibling Thinking model, the Instruct model answers directly without a <think> reasoning block. Mellum 2 uses 64 experts with 8 active per token (about 2.5B active parameters out of 12B), a mix of sliding-window and full-attention layers, and a 131,072-token context window.

Tool calling was verified end to end against a live mlx_lm.server driven by the swival agent harness, run side by side with the full-precision model: the 8-bit weights matched it exactly, issuing well-formed read_file, edit_file, write_file, list_files, and shell-command calls with no malformed calls. Generation stops cleanly on <|im_end|> (the eos_token_id is set to [0, 28], which is what lets agent harnesses see a proper tool_calls finish reason).

Requirements

The mellum architecture is not supported by the stock mlx-lm code yet.

Until it is supported upstream, install this fork of mlx-lm from source:

pip install git+https://github.com/jedisct1/mlx-lm

Or run it directly with uv:

uvx --from git+https://github.com/jedisct1/mlx-lm mlx_lm.server

Use with mlx-lm

Quick test:

uvx --from git+https://github.com/jedisct1/mlx-lm \
  mlx_lm.generate --model jedisct1/Mellum2-12B-A2.5B-Instruct-mlx-8bit \
  --prompt "Write a Python function that reverses a linked list." \
  --max-tokens 16384 \
  --temp 0.6 --top-p 0.95 --top-k 20

Starting the server:

uvx --from git+https://github.com/jedisct1/mlx-lm \
  mlx_lm.server --model jedisct1/Mellum2-12B-A2.5B-Instruct-mlx-8bit \
  --max-tokens 16384 \
  --temp 0.6 --top-p 0.95 --top-k 20

The recommended sampling settings from JetBrains are temperature=0.6, top_p=0.95, top_k=20.

Using this setup with the Swival.dev harness

Install swival.dev:

uv tool install swival

Then point it at the running server:

swival --provider llamacpp --model jedisct1/Mellum2-12B-A2.5B-Instruct-mlx-8bit

License

Apache 2.0, inherited from the original model.

Downloads last month: 297

Safetensors

Model size

12B params

Tensor type

BF16

U32

MLX

Hardware compatibility

8-bit

Model tree for jedisct1/Mellum2-12B-A2.5B-Instruct-mlx-8bit

Base model

JetBrains/Mellum2-12B-A2.5B-Instruct

Quantized

(16)

this model

Collection including jedisct1/Mellum2-12B-A2.5B-Instruct-mlx-8bit

Mellum2-12B

Collection

4 items • Updated 6 days ago