Instructions to use clglavan/magos-k8s-0.6b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use clglavan/magos-k8s-0.6b with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="clglavan/magos-k8s-0.6b")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForMultimodalLM

tokenizer = AutoTokenizer.from_pretrained("clglavan/magos-k8s-0.6b")
model = AutoModelForMultimodalLM.from_pretrained("clglavan/magos-k8s-0.6b")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

llama-cpp-python

How to use clglavan/magos-k8s-0.6b with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="clglavan/magos-k8s-0.6b",
	filename="magos-k8s-0.6b-f16.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use clglavan/magos-k8s-0.6b with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf clglavan/magos-k8s-0.6b:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf clglavan/magos-k8s-0.6b:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf clglavan/magos-k8s-0.6b:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf clglavan/magos-k8s-0.6b:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf clglavan/magos-k8s-0.6b:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf clglavan/magos-k8s-0.6b:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf clglavan/magos-k8s-0.6b:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf clglavan/magos-k8s-0.6b:Q4_K_M

Use Docker

docker model run hf.co/clglavan/magos-k8s-0.6b:Q4_K_M

LM Studio
Jan

vLLM

How to use clglavan/magos-k8s-0.6b with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "clglavan/magos-k8s-0.6b"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "clglavan/magos-k8s-0.6b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/clglavan/magos-k8s-0.6b:Q4_K_M

SGLang

How to use clglavan/magos-k8s-0.6b with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "clglavan/magos-k8s-0.6b" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "clglavan/magos-k8s-0.6b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "clglavan/magos-k8s-0.6b" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "clglavan/magos-k8s-0.6b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Ollama
How to use clglavan/magos-k8s-0.6b with Ollama:
```
ollama run hf.co/clglavan/magos-k8s-0.6b:Q4_K_M
```

Unsloth Studio

How to use clglavan/magos-k8s-0.6b with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for clglavan/magos-k8s-0.6b to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for clglavan/magos-k8s-0.6b to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for clglavan/magos-k8s-0.6b to start chatting

How to use clglavan/magos-k8s-0.6b with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf clglavan/magos-k8s-0.6b:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "clglavan/magos-k8s-0.6b:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use clglavan/magos-k8s-0.6b with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf clglavan/magos-k8s-0.6b:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default clglavan/magos-k8s-0.6b:Q4_K_M

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use clglavan/magos-k8s-0.6b with Docker Model Runner:
```
docker model run hf.co/clglavan/magos-k8s-0.6b:Q4_K_M
```

Lemonade

How to use clglavan/magos-k8s-0.6b with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull clglavan/magos-k8s-0.6b:Q4_K_M

Run and chat with the model

lemonade run user.magos-k8s-0.6b-Q4_K_M

List all available models

lemonade list

magos-k8s-0.6b

magos-k8s-0.6b is a 0.6B-parameter reasoning model for Kubernetes diagnostics, derived from Qwen3-0.6B. It is trained in two full-weight stages: continued pre-training (CPT) on Kubernetes documentation, the v1.34 API reference (every resource Kind), the kubectl command reference, and Prometheus alert runbooks; followed by supervised fine-tuning (SFT) on event→YAML diagnostic pairs. Each response is a structured <think> reasoning trace followed by a concise answer — a kubectl/promtool command, a YAML patch, or a root cause plus fix.

Scope and design

The model targets a narrow task: mapping a Kubernetes symptom (a failed or Warning condition, a kubectl describe/events excerpt, a misconfigured manifest) to the responsible spec field and the corrective action. The reasoning trace is intentionally short and templated (implicated condition → spec field → verdict → fix / next command) rather than open-ended chain-of-thought — that is the form a 0.6B model reproduces reliably without drifting into invented detail.

Because every response terminates in a concrete next action, the model fits as the inner-loop reasoner of a planner→executor devops agent. It is full-weight fine-tuned (no LoRA/adapters), ships as bf16 safetensors plus GGUF quantizations, and runs locally at ~640 MB (Q8). Knowledge is frozen at the training-snapshot; treat it as a reasoning component, not a source of truth, and verify field/flag specifics against current docs or live kubectl explain.

What's new in v16 (current stable)

v16 is the largest and broadest corpus yet — ~108k <think> reasoning examples, all derived from the official Kubernetes sources and built so the model only ever phrases scenarios around verified facts (every YAML field is checked against the v1.34 OpenAPI schema; every flag against the kubectl reference). It combines two tracks:

Event-grounded diagnostic matched pairs (the v15 design): a BROKEN case (failed/Warning events ↔ the exact offending YAML field) and a HEALTHY case (clean events ↔ the same field set correctly), across ~80 failure subcategories (scheduling, image, crashloop, probes, volumes, networking, RBAC/PodSecurity, controllers, quota/limits, …).
Command-reference: correct kubectl invocations across ~45 subcommands and their flags.

Every answer is a short, structured <think> chain (events → correlate to field → verdict → fix, or goal → command) followed by a concise YAML patch or command — the form a 0.6B model reproduces reliably without drifting into invented detail.

	v15	v16
Corpus	~16.6k diagnostic	~108k (diagnostic + command-reference)
Coverage	~80 diagnostic subcategories	+ ~45 kubectl subcommands/flags
Recipe	4 epochs · LR 2e-5 · batch 32	4 epochs · LR 2e-5 · batch 32

Strengths: diagnosing from pasted events/describe output, YAML generation/review, and structured next-step reasoning. It is full-weight fine-tuned (no LoRA), schema- grounded, and low-hallucination by construction.

To pin a specific version when loading:

AutoModelForCausalLM.from_pretrained("clglavan/magos-k8s-0.6b", revision="v16")
# or revision="v15" / "v8" / "v7" / "v6" / "v5" / "v3" / "v2" for previous versions

What it's good at

Diagnosing from events — paste kubectl get events / kubectl describe output and it correlates the failure to the responsible YAML field + fix.
YAML manifest generation and review — a top strength; correct apiVersion/field names across Pod, Deployment, Service, NetworkPolicy, PVC, HPA, Ingress, RBAC and many other Kinds (schema-validated training set).
kubectl command construction — broad subcommand/flag coverage from the reference (the v16 command-reference track).
Prometheus alert handling — meaning + diagnostic steps for the prometheus-operator runbook set.
Structured next-step reasoning — short <think> that ends in a concrete command or fix, suitable as an agent's inner-loop reasoner.

What it's not good at

Multi-step planning or complex tool chains — it's a 0.6B model.
Subtle/rare flags and multi-flag combinations — verify with kubectl --help.
General (non-Kubernetes) reasoning — this corpus is K8s-focused.
Knowledge of features released after the source docs were captured (mid-2026).

How to use

Important — sampling: v16 is a reasoning model. Run it greedy with repetition_penalty = 1.0. A repetition penalty > 1.0 penalizes the prompt words the <think> block needs to reference and collapses it to an empty <think></think>. (This differs from the terse v8, which used temp 0.05 / rep 1.15.)

llama.cpp / Ollama / LM Studio

File	Size	Quality
`magos-k8s-0.6b-f16.gguf`	~1.2 GB	reference (full precision)
`magos-k8s-0.6b-q8_0.gguf`	~640 MB	effectively identical to f16 — recommended
`magos-k8s-0.6b-q4_k_m.gguf`	~400 MB	smallest; more field/flag mistakes — fine for casual use

from llama_cpp import Llama

llm = Llama(model_path="magos-k8s-0.6b-q8_0.gguf", n_ctx=4096, chat_format="chatml")
resp = llm.create_chat_completion(
    messages=[{"role": "user", "content":
        "kubectl describe pod shows: Warning FailedScheduling 0/3 nodes are available: 3 Insufficient memory. Why?"}],
    temperature=0.0,
    repeat_penalty=1.0,
    max_tokens=512,
)
print(resp["choices"][0]["message"]["content"])

Hugging Face transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

tok   = AutoTokenizer.from_pretrained("clglavan/magos-k8s-0.6b")
model = AutoModelForCausalLM.from_pretrained("clglavan/magos-k8s-0.6b",
                                             dtype="bfloat16",
                                             device_map="auto")

messages = [{"role": "user", "content":
    "My pod is CrashLoopBackOff right after deploy. What's the likely cause and fix?"}]
prompt = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=512,
                     do_sample=False, repetition_penalty=1.0)
print(tok.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Training


Base model	Qwen/Qwen3-0.6B
Method	Two stage: continued pre-training (CPT) → supervised fine-tuning (SFT). Both full-weight (no LoRA).
Stage 1 corpus	~~8.5k document chunks: kubernetes.io docs + blog (~~6.5k), Kubernetes API reference v1.34 (~~1.9k), Prometheus alert runbooks (~~106). Unchanged since v5.
Stage 1	LR 5e-6, cosine, 1 epoch (~6.5M tokens)
Stage 2 corpus (v16)	~108k synthetic Q&A pairs derived from the official documentation, all with a structured `<think>` reasoning block: event→YAML diagnostic matched BROKEN/HEALTHY pairs across ~~80 K8s failure subcategories plus a `kubectl` command-reference track (~~45 subcommands + flags). Every YAML field is validated against the v1.34 OpenAPI schema and every flag against the kubectl reference, so the teacher only phrases scenarios around verified facts.
Stage 2	LR 2e-5, cosine, 4 epochs, micro-batch 1 / grad-accum 32 (effective batch 32), seq len 2048, bf16

Files

model.safetensors — fine-tuned weights, HF format (bf16)
magos-k8s-0.6b-f16.gguf / -q8_0.gguf / -q4_k_m.gguf — GGUF quantizations
tokenizer.json, tokenizer_config.json, chat_template.jinja — Qwen3 tokenizer + ChatML template
config.json, generation_config.json — standard HF configs

Limitations and intended use

This is a small experimental model. Always verify any command, YAML, or behavioral claim against current Kubernetes documentation before running in production. Intended for learning, prototyping, and as a component in local devops agents — not as an authoritative source.

License

Apache 2.0. Inherits from the Qwen3-0.6B base model license. The training data is derived from the official Kubernetes documentation (CC-BY 4.0) and the prometheus-operator Prometheus runbooks (Apache 2.0).

Downloads last month: 1,010

Safetensors

Model size

0.6B params

Tensor type

BF16

Model tree for clglavan/magos-k8s-0.6b

Base model

Qwen/Qwen3-0.6B-Base

Finetuned

Qwen/Qwen3-0.6B

Quantized

(330)

this model