Instructions to use jedisct1/Mellum2-12B-A2.5B-Instruct-mlx-8bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use jedisct1/Mellum2-12B-A2.5B-Instruct-mlx-8bit with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("jedisct1/Mellum2-12B-A2.5B-Instruct-mlx-8bit") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
- Pi
How to use jedisct1/Mellum2-12B-A2.5B-Instruct-mlx-8bit with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "jedisct1/Mellum2-12B-A2.5B-Instruct-mlx-8bit"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "jedisct1/Mellum2-12B-A2.5B-Instruct-mlx-8bit" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use jedisct1/Mellum2-12B-A2.5B-Instruct-mlx-8bit with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "jedisct1/Mellum2-12B-A2.5B-Instruct-mlx-8bit"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default jedisct1/Mellum2-12B-A2.5B-Instruct-mlx-8bit
Run Hermes
hermes
- MLX LM
How to use jedisct1/Mellum2-12B-A2.5B-Instruct-mlx-8bit with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "jedisct1/Mellum2-12B-A2.5B-Instruct-mlx-8bit"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "jedisct1/Mellum2-12B-A2.5B-Instruct-mlx-8bit" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "jedisct1/Mellum2-12B-A2.5B-Instruct-mlx-8bit", "messages": [ {"role": "user", "content": "Hello"} ] }'
Mellum2-12B-A2.5B-Instruct-mlx-8bit
This is an 8-bit MLX quantization of
JetBrains/Mellum2-12B-A2.5B-Instruct,
the instruction-tuned Mixture-of-Experts coding assistant from JetBrains. It is derived from the
full-precision jedisct1/Mellum2-12B-A2.5B-Instruct-mlx
conversion.
Every weight is quantized to 8 bits with a group size of 64 (about 8.5 bits per weight overall).
At 8 bits the output is effectively indistinguishable from the bfloat16 model, so this is the
quantization to reach for when you want the original quality at roughly half the memory.
Unlike its sibling Thinking model, the Instruct model answers directly without a <think>
reasoning block. Mellum 2 uses 64 experts with 8 active per token (about 2.5B active parameters
out of 12B), a mix of sliding-window and full-attention layers, and a 131,072-token context
window.
Tool calling was verified end to end against a live mlx_lm.server driven by the swival agent
harness, run side by side with the full-precision model: the 8-bit weights matched it exactly,
issuing well-formed read_file, edit_file, write_file, list_files, and shell-command calls
with no malformed calls. Generation stops cleanly on <|im_end|> (the eos_token_id is set to
[0, 28], which is what lets agent harnesses see a proper tool_calls finish reason).
Requirements
The mellum architecture is not supported by the stock mlx-lm code yet.
Until it is supported upstream, install this fork of mlx-lm from source:
pip install git+https://github.com/jedisct1/mlx-lm
Or run it directly with uv:
uvx --from git+https://github.com/jedisct1/mlx-lm mlx_lm.server
Use with mlx-lm
Quick test:
uvx --from git+https://github.com/jedisct1/mlx-lm \
mlx_lm.generate --model jedisct1/Mellum2-12B-A2.5B-Instruct-mlx-8bit \
--prompt "Write a Python function that reverses a linked list." \
--max-tokens 16384 \
--temp 0.6 --top-p 0.95 --top-k 20
Starting the server:
uvx --from git+https://github.com/jedisct1/mlx-lm \
mlx_lm.server --model jedisct1/Mellum2-12B-A2.5B-Instruct-mlx-8bit \
--max-tokens 16384 \
--temp 0.6 --top-p 0.95 --top-k 20
The recommended sampling settings from JetBrains are temperature=0.6, top_p=0.95, top_k=20.
Using this setup with the Swival.dev harness
Install swival.dev:
uv tool install swival
Then point it at the running server:
swival --provider llamacpp --model jedisct1/Mellum2-12B-A2.5B-Instruct-mlx-8bit
License
Apache 2.0, inherited from the original model.
- Downloads last month
- 297
8-bit
Model tree for jedisct1/Mellum2-12B-A2.5B-Instruct-mlx-8bit
Base model
JetBrains/Mellum2-12B-A2.5B-Instruct