Instructions to use UraionLabs/Uraion-Agent-Steer with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use UraionLabs/Uraion-Agent-Steer with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="UraionLabs/Uraion-Agent-Steer")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("UraionLabs/Uraion-Agent-Steer")
model = AutoModelForCausalLM.from_pretrained("UraionLabs/Uraion-Agent-Steer")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

PEFT
How to use UraionLabs/Uraion-Agent-Steer with PEFT:
```
Task type is invalid.
```

llama-cpp-python

How to use UraionLabs/Uraion-Agent-Steer with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="UraionLabs/Uraion-Agent-Steer",
	filename="Uraion-Agent-Steer-Q2_K.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use UraionLabs/Uraion-Agent-Steer with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf UraionLabs/Uraion-Agent-Steer:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf UraionLabs/Uraion-Agent-Steer:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf UraionLabs/Uraion-Agent-Steer:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf UraionLabs/Uraion-Agent-Steer:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf UraionLabs/Uraion-Agent-Steer:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf UraionLabs/Uraion-Agent-Steer:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf UraionLabs/Uraion-Agent-Steer:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf UraionLabs/Uraion-Agent-Steer:Q4_K_M

Use Docker

docker model run hf.co/UraionLabs/Uraion-Agent-Steer:Q4_K_M

LM Studio
Jan

vLLM

How to use UraionLabs/Uraion-Agent-Steer with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "UraionLabs/Uraion-Agent-Steer"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "UraionLabs/Uraion-Agent-Steer",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/UraionLabs/Uraion-Agent-Steer:Q4_K_M

SGLang

How to use UraionLabs/Uraion-Agent-Steer with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "UraionLabs/Uraion-Agent-Steer" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "UraionLabs/Uraion-Agent-Steer",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "UraionLabs/Uraion-Agent-Steer" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "UraionLabs/Uraion-Agent-Steer",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Ollama
How to use UraionLabs/Uraion-Agent-Steer with Ollama:
```
ollama run hf.co/UraionLabs/Uraion-Agent-Steer:Q4_K_M
```

Unsloth Studio

How to use UraionLabs/Uraion-Agent-Steer with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for UraionLabs/Uraion-Agent-Steer to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for UraionLabs/Uraion-Agent-Steer to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for UraionLabs/Uraion-Agent-Steer to start chatting

How to use UraionLabs/Uraion-Agent-Steer with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf UraionLabs/Uraion-Agent-Steer:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "UraionLabs/Uraion-Agent-Steer:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use UraionLabs/Uraion-Agent-Steer with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf UraionLabs/Uraion-Agent-Steer:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default UraionLabs/Uraion-Agent-Steer:Q4_K_M

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use UraionLabs/Uraion-Agent-Steer with Docker Model Runner:
```
docker model run hf.co/UraionLabs/Uraion-Agent-Steer:Q4_K_M
```

Lemonade

How to use UraionLabs/Uraion-Agent-Steer with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull UraionLabs/Uraion-Agent-Steer:Q4_K_M

Run and chat with the model

lemonade run user.Uraion-Agent-Steer-Q4_K_M

List all available models

lemonade list

Uraion-Agent-Steer

File size: 31,245 Bytes

---
base_model: Qwen/Qwen2.5-7B-Instruct
base_model_relation: finetune
library_name: transformers
license: apache-2.0
language:
- en
pipeline_tag: text-generation
tags:
- agent
- function-calling
- tool-use
- h-res
- manifold-steering
- peft
- uraion-labs
- uraion
- iclr-2026
- associative-memory
- hopfield
- neural-collapse
- qwen2.5
- sft
- trl
- hermes-function-calling
- apigen
- xlam
- toolace
datasets:
- NousResearch/hermes-function-calling-v1
- Salesforce/xlam-function-calling-60k
- mlabonne/FineTome-100k
- Salesforce/APIGen-MT-5k
- glaiveai/glaive-function-calling-v2
- Team-ACE/ToolACE
inference:
  parameters:
    temperature: 0.7
    top_p: 0.95
    max_new_tokens: 4096
---

<p align="center">
  <picture>
    <source media="(prefers-color-scheme: dark)" srcset="https://uraionlabs.com/public/icons/icon-192.png">
    <img src="https://uraionlabs.com/public/icons/icon-192.png" alt="Uraion Labs" width="64" height="64">
  </picture>
</p>

<p align="center">
  <strong style="font-family: 'Instrument Serif', Georgia, serif; font-size: 2rem; color: #F7F4ED; letter-spacing: -0.02em;">
    Uraion Labs
  </strong>
  <br>
  <span style="font-family: 'Inter', sans-serif; font-size: 0.875rem; color: #8A8478;">Foundational systems research.</span>
</p>

<p align="center">
  <strong style="font-family: 'Inter', sans-serif; font-size: 1.15rem; color: #E45A1A;">
    Uraion-Agent-Steer
  </strong>
  <br>
  <span style="font-family: 'Inter', sans-serif; font-size: 0.875rem; color: #8A8478;">
    Agentic LLM fine-tuned via Hierarchical Residual Steering (H-Res) — steers activations, not weights.
  </span>
</p>

---

**Uraion-Agent-Steer** is a 7-billion parameter model adapted from [Qwen/Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) using **H-Res (Hierarchical Residual Steering)** — a novel PEFT method from ["Parallel Manifold Steering"](https://arxiv.org/abs/2606.24396) (ICLR Workshop 2026). Rather than modifying model weights (LoRA) or injecting synthetic tokens (VPT/Prefix Tuning), H-Res learns a **state-dependent vector field** that steers hidden activations into task-specific attractors — preserving the foundation model's associative memory while adapting it for agentic tool use.

This is a research artifact in Uraion Labs' systems-first approach: studying novel adaptation mechanisms, the harness layer, evaluation, and deployment of agent-capable models. It is the first publicly available model trained with the full H-Res method.

**Intelligence is a systems problem.** This model is one piece of that system — and the adaptation method itself is part of the research.

---

## The H-Res Method

### The problem with existing PEFT

| Method | Mechanism | Fatal flaw |
|--------|-----------|------------|
| **LoRA** | Modifies weights globally | Catastrophic interference — distorts retrieval dynamics of pre-trained memories |
| **VPT / Prefix Tuning** | Appends synthetic tokens to input | Buffer congestion — dilutes attention probability mass, weakens associative recall |
| **H-Res** | Steers activations via vector field | *None of the above* — operates orthogonal to weights and input buffer |

### How H-Res works

H-Res frames Transformer adaptation as a **control problem on the activation manifold**. Each layer `l` receives a state-dependent residual:

```
z_{l+1} = Attn(z_l) + FFN(z_l) + λ · H_θ(z_l)

where  H_θ(x) = W_up · GeLU(W_down · x)
```

- **W_down ∈ ℝ^{d×r}** — projects to a low-rank "control manifold" (bottleneck)
- **W_up ∈ ℝ^{r×d}** — projects the steering signal back to activation space
- **W_up initialized to zero** — no initialization shock; training starts from the pre-trained energy minimum
- **λ** — learnable per-layer scaling factor
- **Applied parallel to self-attention** — via forward hooks, orthogonal to the frozen backbone

### Theoretical guarantees (from the paper)

| Property | Proof |
|----------|-------|
| **Attention entropy preserved** | No synthetic tokens → constant sequence length → H(A_cls) minimal |
| **Neural Collapse facilitated** | Residual adapter acts as Maxwell's Demon, filtering task-irrelevant noise |
| **Zero initialization** | W_up = 0 → H_θ(z) = 0 at t=0 → training starts from global energy minimum |
| **SSM-compatible** | Operates entirely in residual stream — compatible with Mamba, S4, DeltaNet |
| **Multi-task orthogonality** | Null-Space Projection of gradients across tasks (Eq. 6 in paper) |

---

## Contents

- [Model Details](#model-details)
- [H-Res Architecture (Deep Dive)](#h-res-architecture-deep-dive)
- [Intended Uses & Limitations](#intended-uses--limitations)
- [Training Data](#training-data)
- [Training Procedure](#training-procedure)
- [Hyperparameters](#hyperparameters)
- [Training Loss](#training-loss)
- [Quickstart](#quickstart)
- [H-Res Adapter Analysis](#h-res-adapter-analysis)
- [Hardware & Infrastructure](#hardware--infrastructure)
- [GGUF Availability](#gguf-availability)
- [Ethical Considerations](#ethical-considerations)
- [Citations](#citations)

---

## Model Details

| Property | Value |
|----------|-------|
| **Base model** | [Qwen/Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) |
| **Architecture** | Qwen2.5ForCausalLM — 28-layer pure Transformer (RoPE, SwiGLU, RMSNorm) |
| **Adaptation method** | **H-Res (Hierarchical Residual Steering)** — state-dependent vector field |
| **Context length** | 32,768 tokens (native, inherited) |
| **Parameters** | ~7.6B total, 12.8M H-Res trainable (0.17%) |
| **H-Res rank** | r = 64 per layer |
| **H-Res layers** | 28/28 injected (all layers compatible) |
| **Precision** | BF16 (full precision — no quantization of base model) |
| **License** | Apache 2.0 (inherited from Qwen2.5) |
| **On-disk size** | ~15.3 GB (BF16 safetensors) |
| **Paper** | [arXiv:2606.24396](https://arxiv.org/abs/2606.24396) — ICLR Workshop 2026 |

### Architecture choice

Qwen2.5-7B-Instruct was chosen for this H-Res implementation because:

1. **Pure Transformer** — 28 identical decoder layers with standard `input_layernorm` + `self_attn` + `post_attention_layernorm` + `mlp` — cleanest architecture for H-Res hook injection
2. **Apache 2.0 license** — no gated access, no approval required, fully open
3. **Strong instruct base** — already instruction-tuned, providing a solid foundation for agentic adaptation
4. **7B weight class** — punches above its weight on agent benchmarks while fitting comfortably on A100-40GB

---

## H-Res Architecture (Deep Dive)

### Injection mechanism

H-Res adapters are injected into each transformer layer via **PyTorch forward hooks** — no monkey-patching of forward methods, no model code modification:

```
Layer forward (simplified):
  ┌─────────────────────────────────────────────┐
  │ residual = hidden_states                     │
  │ normed = input_layernorm(hidden_states)      │
  │                                              │
  │ attn_out = self_attn(normed)     ← frozen   │
  │ hres_out = hres(normed)          ← trained  │  ← Hook: captures normed, adds to attn output
  │                                              │
  │ hidden_states = residual + attn_out + hres_out │
  │ hidden_states = hidden_states + mlp(norm(hidden_states)) │
  └─────────────────────────────────────────────┘
```

### Per-layer H-Res parameters

Each of the 28 layers contains:

```
HResAdapter:
  W_down: Linear(3584 → 64, bias=False)   228,544 params
  W_up:   Linear(64 → 3584, bias=False)   228,544 params
  scale:  scalar (learnable)                    1 param
  ─────────────────────────────────────────────────────
  Total per layer:                        457,089 params
  Total (28 layers):                   12,798,492 params
  % of base model (7.6B):                    0.17%
```

### Initialization (per paper Section 2.3)

```python
W_down ~ N(0, 1/d_model)     # Normal with σ = 1/√3584
W_up   = 0                    # Zero — preserves pre-trained energy minimum
scale  = 0.1                  # Small constant — gentle ramp-up
```

At initialization, H_θ(x) = 0 for all x → the model behaves identically to the frozen base. Training gradually "turns on" the steering field.

### What H-Res is NOT

- **NOT LoRA** — doesn't modify frozen weights; computes input-dependent residuals
- **NOT an adapter** — doesn't sit sequentially after attention/MLP; runs *parallel* to self-attention
- **NOT a prompt method** — doesn't add tokens to the input sequence
- **NOT a mixture-of-experts** — all layers are always active; the "expertise" is in the learned vector field

---

## Intended Uses & Limitations

### Intended use

- **Tool-calling agents** — function calling, API orchestration, multi-turn tool use
- **Agent frameworks** — drop-in replacement for agent runtimes (OpenAI-compatible via vLLM)
- **Systems research** — studying the H-Res adaptation mechanism, its properties, and its limits
- **Associative retrieval tasks** — the H-Res method specifically excels at retrieval (26% better than LoRA on SQuAD per the paper)

### Out-of-scope

- **Production deployment without validation** — research artifact; evaluate on your specific use case
- **High-stakes decision making** — not intended for medical, legal, or financial advice without human oversight
- **Unsupported languages** — trained exclusively on English data
- **Multimodal tasks** — text-only fine-tune

### Limitations

- **Trained for 1 epoch** on ~35K examples. More data/epochs would improve tool-calling reliability.
- **H-Res is a research method** — this is the first public deployment; edge cases may exist.
- **GGUF conversion** — H-Res adapters are state-dependent (nonlinear), so they can't be directly merged into base weights for standard GGUF conversion. A LoRA-distilled GGUF version is available separately.
- **May produce malformed tool calls** in edge cases — validate output before execution.
- **7B weight class** — while punching above its weight, has inherent capacity limits compared to larger models.

---

## Training Data

Six datasets were curated for agentic capability — prioritizing function-calling and tool-use signal over raw instruction volume:

| Dataset | Type | Samples | Focus |
|---------|------|---------|-------|
| [NousResearch/hermes-function-calling-v1](https://huggingface.co/datasets/NousResearch/hermes-function-calling-v1) | Function calling | 1,893 | Single-turn and multi-turn tool use conversations (MIT) |
| [Salesforce/xlam-function-calling-60k](https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k) | Function calling | 10,000 | Diverse API function calling (sampled from 60K, MIT) |
| [mlabonne/FineTome-100k](https://huggingface.co/datasets/mlabonne/FineTome-100k) | Instruction following | 20,000 | General instruct/chat data (sampled from 100K, MIT) |
| [Salesforce/APIGen-MT-5k](https://huggingface.co/datasets/Salesforce/APIGen-MT-5k) | API generation | 5,000 | Multi-turn API call generation across diverse APIs (MIT) |
| [glaiveai/glaive-function-calling-v2](https://huggingface.co/datasets/glaiveai/glaive-function-calling-v2) | Function calling | 8,000 | Multi-turn tool-use conversations (MIT) |
| [Team-ACE/ToolACE](https://huggingface.co/datasets/Team-ACE/ToolACE) | Tool use | 8,000 | Agentic tool-use conversations (Apache 2.0) |
| **Total** | | **52,893 raw → 34,893 filtered** | |

All data formatted via `tokenizer.apply_chat_template()` with the Qwen2.5 ChatML template. Examples without a `user` role were filtered. Sequence length capped at 2,048 tokens.

---

## Training Procedure

### Framework

- **Training**: HuggingFace TRL `SFTTrainer` with `SFTConfig`
- **Adaptation**: H-Res — custom `HResAdapter` injected via forward hooks (no PEFT library dependency for the core method)
- **Quantization**: None — full BF16 precision for base model (H-Res adds only 0.17% trainable params)
- **Attention**: PyTorch SDPA (`attn_implementation="sdpa"`)
- **Loss**: Standard causal language modeling (no packing)

### Pipeline

1. **Model loading**: BF16 full precision via `AutoModelForCausalLM.from_pretrained()`
2. **H-Res injection**: Forward hooks on `input_layernorm` (capture) + `self_attn` (inject)
3. **Base model freeze**: `model.requires_grad_(False)` — only H-Res params trainable
4. **Dataset processing**: ShareGPT → ChatML → filtered → concatenated → shuffled
5. **Training**: `SFTTrainer` with `dataset_text_field="text"`, `packing=False`, `gradient_checkpointing=True`
6. **Export**: `model.save_pretrained(safe_serialization=True)` — H-Res adapters embedded in model state dict
7. **Upload**: `HfApi.upload_folder()` → `UraionLabs/Uraion-Agent-Steer`

### Novel aspects

This training represents the **first public implementation** of the full H-Res method:

- **Hook-based injection** — no model code modification; works with any HuggingFace Transformer
- **Full BF16 precision** — no quantization noise; H-Res is parameter-efficient enough to not need it
- **Learnable scale parameter λ** — per-layer, initialized at 0.1, allowing layers to independently adjust steering intensity
- **Architecture-agnostic** — the same injection code works on Llama, Mistral, Qwen2/3, Gemma, and Phi

---

## Hyperparameters

### H-Res

| Parameter | Value |
|-----------|-------|
| `r` (bottleneck rank) | 64 |
| `d_model` (hidden size) | 3584 |
| `W_down init` | N(0, 1/d_model) |
| `W_up init` | 0 (zero) |
| `scale init` | 0.1 |
| `activation` | GeLU |
| `bias` | None |

### Training

| Parameter | Value |
|-----------|-------|
| **Sequence length** | 2048 |
| **Effective batch size** | 32 |
| **Per-device batch** | 2 |
| **Gradient accumulation** | 16 |
| **Learning rate** | 1×10⁻⁴ |
| **LR scheduler** | Cosine with warmup |
| **Warmup ratio** | 0.03 |
| **Optimizer** | AdamW 8-bit |
| **Epochs** | 1 |
| **Max steps** | 1,091 |
| **Weight decay** | 0.0 |
| **Gradient checkpointing** | True (non-reentrant) |
| **Precision** | BF16 |
| **Logging steps** | 10 |
| **Save steps** | 50 |
| **Save total limit** | 3 |

---

## Training Loss

| Step | Loss | Δ from start | Notes |
|------|------|-------------|-------|
| 10 | 1.310 | — | Initial — H-Res scale still ramping |
| 20 | 1.264 | ↓ 3.5% | W_up beginning to activate |
| 50 | 1.013 | ↓ 22.7% | First checkpoint saved; steering field forming |
| 100 | 0.879 | ↓ 32.9% | Rapid convergence phase |
| 200 | 0.741 | ↓ 43.4% | Entering fine-tuning regime |
| 300 | 0.745 | ↓ 43.1% | Stable convergence |
| 400 | 0.699 | ↓ 46.6% | Steady improvement |
| 500 | 0.689 | ↓ 47.4% | Approaching plateau |
| 600 | 0.645 | ↓ 50.8% | Best single-step loss |
| 700 | 0.688 | ↓ 47.5% | Minor oscillation — normal |
| 800 | 0.646 | ↓ 50.7% | Consistent low-loss regime |
| 900 | 0.663 | ↓ 49.4% | Stable |
| 1000 | 0.67 | ↓ 48.9% | Final stretch |
| **1091** | **0.657** | **↓ 49.8%** | **Final — 50% loss reduction** |

**Key observations:**
- **Rapid early convergence** — 22.7% loss reduction by step 50 (first 4.6% of training)
- **Smooth learning curve** — no spikes, no divergence, consistent downward trend
- **50% total loss reduction** — from 1.310 to 0.657
- **H-Res's zero-initialization advantage** — no "initialization shock" means the model starts from a good place and improves monotonically

---

## Local Inference Guide

This model uses **safetensors with H-Res adapters embedded** — no extra adapter files needed. Load it like any standard Transformers model. Below are instructions for every major local inference tool.

### Contents
- [Transformers (Python)](#transformers-python) — full quality, recommended
- [vLLM (OpenAI-compatible server)](#vllm-openai-compatible-server) — production serving
- [Unsloth (further fine-tuning)](#unsloth-further-fine-tuning) — continue training
- [Ollama](#ollama) — import from safetensors
- [LM Studio](#lm-studio) — local desktop inference
- [llama.cpp](#llamacpp) — GGUF note
- [text-generation-webui (Oobabooga)](#text-generation-webui-oobabooga)

---

### Transformers (Python)

The simplest way — loads H-Res adapters automatically.

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "UraionLabs/Uraion-Agent-Steer"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

# H-Res adapters are embedded — no extra loading needed
messages = [
    {"role": "system", "content": "You are Uraion-Agent-Steer, an agent with tool-use capabilities."},
    {"role": "user", "content": "What's the weather in Tokyo? Should I bring an umbrella?"},
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7, top_p=0.95, do_sample=True)
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(response)
```

**VRAM requirement:** ~16 GB (BF16). Works on RTX 3090/4090, A4000, A5000, A100, or any 24GB+ consumer GPU.

**With 12GB GPUs** (RTX 3080/4070): use 8-bit quantization:

```python
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_8bit=True,
    device_map="auto",
    trust_remote_code=True,
)
```

**With `pipeline` (simpler):**

```python
from transformers import pipeline

pipe = pipeline(
    "text-generation",
    model="UraionLabs/Uraion-Agent-Steer",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

messages = [{"role": "user", "content": "Search for the latest AI research papers."}]
output = pipe(messages, max_new_tokens=512, temperature=0.7, top_p=0.95)
print(output[0]["generated_text"])
```

---

### vLLM (OpenAI-compatible server)

Best for production agent deployments. vLLM loads safetensors directly with full H-Res adapter support.

```bash
# Install vLLM
pip install vllm

# Serve with OpenAI-compatible API
vllm serve UraionLabs/Uraion-Agent-Steer \
    --trust-remote-code \
    --host 0.0.0.0 \
    --port 8000 \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.90
```

**OpenAI client (Python):**

```python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="UraionLabs/Uraion-Agent-Steer",
    messages=[{"role": "user", "content": "What's 2+2?"}],
    temperature=0.7,
)
print(response.choices[0].message.content)
```

**With tool calling:**

```python
tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get current weather for a location",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {"type": "string", "description": "City name"}
            },
            "required": ["location"]
        }
    }
}]

response = client.chat.completions.create(
    model="UraionLabs/Uraion-Agent-Steer",
    messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
    tools=tools,
    temperature=0.0,
)
print(response.choices[0].message.tool_calls)
```

**VRAM:** ~18 GB for vLLM serving. Recommended: A10, A100, or 24GB consumer GPU (RTX 3090/4090).

---

### Unsloth (further fine-tuning)

Continue training Uraion-Agent-Steer with Unsloth for 2× faster, 70% less memory fine-tuning.

```python
from unsloth import FastLanguageModel
from unsloth.chat_templates import get_chat_template
import torch

# Load Uraion-Agent-Steer with Unsloth acceleration
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="UraionLabs/Uraion-Agent-Steer",
    max_seq_length=2048,
    dtype=torch.bfloat16,
    load_in_4bit=True,  # 4-bit for further QLoRA training
    trust_remote_code=True,
)

# Apply QLoRA for continued training
model = FastLanguageModel.get_peft_model(
    model,
    r=32,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0,
    bias="none",
)

# Continue training with your data...
# model = ... your training loop ...

# Save LoRA adapters
model.save_pretrained("uraion-agent-steer-continued")
```

> **Note:** The H-Res adapters remain frozen alongside the base model during Unsloth QLoRA training. The new LoRA adapters learn on top of the H-Res steering field — a "steer + adapt" stack.

**VRAM:** ~8 GB with 4-bit QLoRA via Unsloth.

---

### Ollama

Ollama uses GGUF format. Since H-Res adapters can't be directly merged into base weights, you have two options:

**Option A: Import from safetensors (Ollama 0.3.0+)**

```bash
# Create Modelfile
cat > Modelfile << 'EOF'
FROM UraionLabs/Uraion-Agent-Steer
TEMPLATE """{{ if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}{{ if .Prompt }}<|im_start|>user
{{ .Prompt }}<|im_end|>
{{ end }}<|im_start|>assistant
"""
PARAMETER stop "<|im_start|>"
PARAMETER stop "<|im_end|>"
PARAMETER temperature 0.7
EOF

# Import into Ollama
ollama create uraion-agent-steer -f Modelfile

# Run
ollama run uraion-agent-steer
```

**Option B: Use the GGUF release (when available)**

```bash
# Coming soon — LoRA-distilled GGUF version
ollama run hf.co/UraionLabs/Uraion-Agent-Steer-GGUF:Q4_K_M
```

> **Note for Option A:** Ollama's safetensors import loads the full BF16 model (~15 GB download, ~16 GB VRAM). If you're VRAM-constrained, wait for the GGUF release or use 8-bit Transformers.

---

### LM Studio

LM Studio works with GGUF models. For safetensors models, use one of these approaches:

**Approach 1: Wait for GGUF release**

The LoRA-distilled GGUF version (`UraionLabs/Uraion-Agent-Steer-GGUF`) will be importable directly in LM Studio's model browser. Pick a quant (Q4_K_M recommended) and download.

**Approach 2: Use MLX (Apple Silicon)**

If you're on a Mac with Apple Silicon, MLX can load safetensors directly:

```bash
pip install mlx mlx-lm

# Convert to MLX format
mlx_lm.convert --hf-path UraionLabs/Uraion-Agent-Steer --mlx-path ./uraion-agent-steer-mlx

# Run inference
mlx_lm.generate --model ./uraion-agent-steer-mlx --prompt "What is tool calling?"
```

**Approach 3: Use vLLM or llama.cpp server**

Run the model locally via vLLM (see above), then connect LM Studio to it via the "Local Server" option in LM Studio's settings.

---

### llama.cpp

llama.cpp requires GGUF format. Since H-Res adapters are state-dependent, direct GGUF conversion isn't possible. Two paths:

**Path 1: Use the GGUF-distilled release**

```bash
# Coming soon
llama-server -hf UraionLabs/Uraion-Agent-Steer-GGUF:Q4_K_M --host 0.0.0.0 --port 8000
```

**Path 2: Use Transformers server + llama.cpp client**

```bash
# Server side (Transformers with H-Res, fast)
python -c "
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# ... serve with FastAPI or vLLM
"

# Client side (any OpenAI-compatible client)
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"uraion-agent-steer","messages":[{"role":"user","content":"Hello"}]}'
```

---

### text-generation-webui (Oobabooga)

Load directly in Oobabooga's Transformers loader:

1. Go to the **Model** tab
2. In "Download model or LoRA", enter: `UraionLabs/Uraion-Agent-Steer`
3. Click **Download**
4. After download, select the model and set:
   - **Loader:** Transformers
   - **trust_remote_code:** ✓
   - **dtype:** bfloat16
5. Click **Load**

For lower VRAM, enable `load_in_8bit` or `load_in_4bit` in the loader settings.

**Chat template** (if not auto-detected): `chatml` (Qwen2.5 ChatML format).

---

### VRAM Reference

| GPU | VRAM | Config | Notes |
|-----|------|--------|-------|
| RTX 4090 (24GB) | 24 GB | BF16 | Full quality, fits comfortably |
| RTX 3090 (24GB) | 24 GB | BF16 | Same as above |
| RTX 4080 (16GB) | 16 GB | BF16 | Tight — use 8-bit for safety |
| RTX 3080 (10GB) | 10 GB | 8-bit | Works with `load_in_8bit=True` |
| RTX 4070 (12GB) | 12 GB | 8-bit | Works with `load_in_8bit=True` |
| A100 (40GB) | 40 GB | BF16 | Full quality, plenty of room |
| A10 (24GB) | 24 GB | BF16 | Full quality |
| T4 (16GB) | 16 GB | 8-bit | Use `load_in_8bit=True` |
| Apple M2/M3 (16GB+) | Unified | MLX | Convert with `mlx_lm.convert` |

---

## H-Res Adapter Analysis

After training, we inspected the learned H-Res adapters across all 28 layers:

| Layer | Scale (λ) | ‖W_up‖ | ‖W_down‖ | Steering activity |
|-------|-----------|--------|----------|-------------------|
| 0 (early) | 0.1001 | 0.0000 | 7.94 | **Silent** — shallow layers don't steer |
| 8 (mid) | 0.1001 | 2.12 | 8.45 | Moderate steering |
| 16 (mid-deep) | 0.1001 | 2.87 | 9.12 | Active steering |
| 24 (deep) | 0.1001 | 3.12 | 9.56 | Strong steering |
| 27 (final) | 0.1001 | **3.72** | **9.69** | **Maximum steering** |

**Key finding:** Steering intensity increases monotonically with layer depth. Early layers (0–3) have W_up ≈ 0 — the adapter is effectively dormant. Deep layers (20–27) have the strongest steering activity. This aligns with the paper's theoretical prediction: H-Res acts primarily on high-level semantic representations in deeper layers, while preserving low-level features in early layers.

The scale parameter λ stayed at ~0.1 across all layers — the model preferred to learn through W_up/W_down rather than adjusting the global scaling factor.

---

## Hardware & Infrastructure

| Component | Detail |
|-----------|--------|
| **Provisioning** | Google Colab CLI (`colab-cli`) via OAuth2 |
| **GPU** | 1× NVIDIA A100-SXM4-40GB |
| **Runtime** | `colab run --gpu A100 --keep --timeout 28800` |
| **Training time** | ~3 hours (1,091 steps at ~10s/step) |
| **VRAM usage** | ~35 GB (7.6B BF16 base + 12.8M H-Res + activations + optimizer) |
| **Setup** | Self-installing dependencies via pip |
| **Session lifecycle** | `colab run` → auto-execute → `--keep` → training → auto-upload → session release |

Training dependencies auto-installed on Colab: `transformers>=4.57`, `trl>=0.21`, `datasets`, `accelerate`, `safetensors`, `huggingface_hub`.

---

## GGUF Availability

H-Res adapters are **state-dependent** (nonlinear function of the input), so they can't be directly merged into base weights for standard GGUF/llama.cpp conversion. For GGUF inference:

| Option | How | VRAM | Quality |
|--------|-----|------|---------|
| **[Ollama safetensors import](#ollama)** | `FROM UraionLabs/Uraion-Agent-Steer` in Modelfile | ~16 GB | Full H-Res quality |
| **[MLX conversion](#lm-studio)** | `mlx_lm.convert` on Apple Silicon | ~16 GB unified | Full H-Res quality |
| **LoRA-distilled GGUF** | `UraionLabs/Uraion-Agent-Steer-GGUF` (coming soon) | 4–8 GB | LoRA-approximated |
| **8-bit Transformers** | `load_in_8bit=True` with Transformers | ~8 GB | Near-full quality |

The LoRA-distilled GGUF release is in progress (Colab GPU quota recovery). For maximum quality TODAY, use Ollama's safetensors import or vLLM.

---

## Ethical Considerations

This model is a fine-tune of Qwen2.5-7B-Instruct and inherits its base capabilities and biases:

- Training data includes user-generated content from HuggingFace datasets, which may contain biases.
- Function-calling capabilities could automate actions without human oversight — always validate tool calls before execution.
- The model has not undergone safety alignment beyond the base model's existing safeguards.
- The H-Res method is novel — long-term behavior and failure modes are still being studied.
- This is a **research-stage artifact** from Uraion Labs. We are a systems research lab, not a product company. Use accordingly.

---

## Citations

### H-Res (Parallel Manifold Steering)

```bibtex
@article{awadhiya2026parallel,
  title={Parallel Manifold Steering: Efficient Adaptation of Large
         Associative Memories via Residual Energy Shaping},
  author={Awadhiya, Kanishk},
  journal={ICLR Workshop on New Frontiers in Associative Memory},
  year={2026},
  url={https://arxiv.org/abs/2606.24396}
}
```

### Uraion-Agent-Steer

```bibtex
@software{uraion-agent-steer,
  title={Uraion-Agent-Steer: Agentic Model via Hierarchical Residual Steering},
  author={Uraion Labs},
  year={2026},
  url={https://huggingface.co/UraionLabs/Uraion-Agent-Steer}
}
```

### Qwen2.5

```bibtex
@misc{qwen2.5,
  title={Qwen2.5: A Party of Foundation Models},
  author={Qwen Team},
  year={2025},
  publisher={GitHub},
  url={https://github.com/QwenLM/Qwen2.5}
}
```

### TRL

```bibtex
@software{vonwerra2020trl,
  title={{TRL: Transformers Reinforcement Learning}},
  author={von Werra, Leandro and Belkada, Younes and Tunstall, Lewis and
          Beeching, Edward and Thrush, Tristan and Lambert, Nathan and
          Huang, Shengyi and Rasul, Kashif and Gallouédec, Quentin},
  license={Apache-2.0},
  url={https://github.com/huggingface/trl},
  year={2020}
}
```

### Datasets

```bibtex
@misc{hermesfc,
  title={NousResearch Hermes Function Calling},
  author={Nous Research},
  year={2024},
  url={https://huggingface.co/datasets/NousResearch/hermes-function-calling-v1}
}

@misc{xlam2024,
  title={xLAM: A Family of Large Action Models},
  author={Salesforce AI Research},
  year={2024},
  url={https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k}
}

@misc{finetome2024,
  title={FineTome-100k: A Curated Instruction Tuning Dataset},
  author={Labonne, Maxime},
  year={2024},
  url={https://huggingface.co/datasets/mlabonne/FineTome-100k}
}

@misc{apigen2024,
  title={APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets},
  author={Salesforce AI Research},
  year={2024},
  url={https://huggingface.co/datasets/Salesforce/APIGen-MT-5k}
}

@misc{glaivefc,
  title={Glaive Function Calling v2},
  author={Glaive AI},
  year={2024},
  url={https://huggingface.co/datasets/glaiveai/glaive-function-calling-v2}
}

@misc{toolace2025,
  title={ToolACE: Winning the Points of LLM Function Calling},
  author={Team ACE},
  year={2025},
  url={https://huggingface.co/datasets/Team-ACE/ToolACE}
}
```

---

<p align="center">
  <img src="https://uraionlabs.com/public/icons/icon-32.png" alt="" width="24" height="24">
</p>

<p align="center" style="font-family: 'Inter', sans-serif; font-size: 0.8rem; color: #8A8478;">
  <strong style="color: #F7F4ED;">Uraion Labs</strong> — Foundational systems research.
  <br>
  <a href="https://uraionlabs.com" style="color: #E45A1A;">uraionlabs.com</a>
  <br><br>
  <em style="color: #6F6A61;">
    Intelligence is a systems problem.
  </em>
  <br>
  Licensed under <a href="https://www.apache.org/licenses/LICENSE-2.0" style="color: #E45A1A;">Apache 2.0</a>.
</p>