Uraion-Agent-Steer
Agentic LLM fine-tuned via Hierarchical Residual Steering (H-Res) — steers activations, not weights.
---
**Uraion-Agent-Steer** is a 7-billion parameter model adapted from [Qwen/Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) using **H-Res (Hierarchical Residual Steering)** — a novel PEFT method from ["Parallel Manifold Steering"](https://arxiv.org/abs/2606.24396) (ICLR Workshop 2026). Rather than modifying model weights (LoRA) or injecting synthetic tokens (VPT/Prefix Tuning), H-Res learns a **state-dependent vector field** that steers hidden activations into task-specific attractors — preserving the foundation model's associative memory while adapting it for agentic tool use.
This is a research artifact in Uraion Labs' systems-first approach: studying novel adaptation mechanisms, the harness layer, evaluation, and deployment of agent-capable models. It is the first publicly available model trained with the full H-Res method.
**Intelligence is a systems problem.** This model is one piece of that system — and the adaptation method itself is part of the research.
---
## The H-Res Method
### The problem with existing PEFT
| Method | Mechanism | Fatal flaw |
|--------|-----------|------------|
| **LoRA** | Modifies weights globally | Catastrophic interference — distorts retrieval dynamics of pre-trained memories |
| **VPT / Prefix Tuning** | Appends synthetic tokens to input | Buffer congestion — dilutes attention probability mass, weakens associative recall |
| **H-Res** | Steers activations via vector field | *None of the above* — operates orthogonal to weights and input buffer |
### How H-Res works
H-Res frames Transformer adaptation as a **control problem on the activation manifold**. Each layer `l` receives a state-dependent residual:
```
z_{l+1} = Attn(z_l) + FFN(z_l) + λ · H_θ(z_l)
where H_θ(x) = W_up · GeLU(W_down · x)
```
- **W_down ∈ ℝ^{d×r}** — projects to a low-rank "control manifold" (bottleneck)
- **W_up ∈ ℝ^{r×d}** — projects the steering signal back to activation space
- **W_up initialized to zero** — no initialization shock; training starts from the pre-trained energy minimum
- **λ** — learnable per-layer scaling factor
- **Applied parallel to self-attention** — via forward hooks, orthogonal to the frozen backbone
### Theoretical guarantees (from the paper)
| Property | Proof |
|----------|-------|
| **Attention entropy preserved** | No synthetic tokens → constant sequence length → H(A_cls) minimal |
| **Neural Collapse facilitated** | Residual adapter acts as Maxwell's Demon, filtering task-irrelevant noise |
| **Zero initialization** | W_up = 0 → H_θ(z) = 0 at t=0 → training starts from global energy minimum |
| **SSM-compatible** | Operates entirely in residual stream — compatible with Mamba, S4, DeltaNet |
| **Multi-task orthogonality** | Null-Space Projection of gradients across tasks (Eq. 6 in paper) |
---
## Contents
- [Model Details](#model-details)
- [H-Res Architecture (Deep Dive)](#h-res-architecture-deep-dive)
- [Intended Uses & Limitations](#intended-uses--limitations)
- [Training Data](#training-data)
- [Training Procedure](#training-procedure)
- [Hyperparameters](#hyperparameters)
- [Training Loss](#training-loss)
- [Quickstart](#quickstart)
- [H-Res Adapter Analysis](#h-res-adapter-analysis)
- [Hardware & Infrastructure](#hardware--infrastructure)
- [GGUF Availability](#gguf-availability)
- [Ethical Considerations](#ethical-considerations)
- [Citations](#citations)
---
## Model Details
| Property | Value |
|----------|-------|
| **Base model** | [Qwen/Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) |
| **Architecture** | Qwen2.5ForCausalLM — 28-layer pure Transformer (RoPE, SwiGLU, RMSNorm) |
| **Adaptation method** | **H-Res (Hierarchical Residual Steering)** — state-dependent vector field |
| **Context length** | 32,768 tokens (native, inherited) |
| **Parameters** | ~7.6B total, 12.8M H-Res trainable (0.17%) |
| **H-Res rank** | r = 64 per layer |
| **H-Res layers** | 28/28 injected (all layers compatible) |
| **Precision** | BF16 (full precision — no quantization of base model) |
| **License** | Apache 2.0 (inherited from Qwen2.5) |
| **On-disk size** | ~15.3 GB (BF16 safetensors) |
| **Paper** | [arXiv:2606.24396](https://arxiv.org/abs/2606.24396) — ICLR Workshop 2026 |
### Architecture choice
Qwen2.5-7B-Instruct was chosen for this H-Res implementation because:
1. **Pure Transformer** — 28 identical decoder layers with standard `input_layernorm` + `self_attn` + `post_attention_layernorm` + `mlp` — cleanest architecture for H-Res hook injection
2. **Apache 2.0 license** — no gated access, no approval required, fully open
3. **Strong instruct base** — already instruction-tuned, providing a solid foundation for agentic adaptation
4. **7B weight class** — punches above its weight on agent benchmarks while fitting comfortably on A100-40GB
---
## H-Res Architecture (Deep Dive)
### Injection mechanism
H-Res adapters are injected into each transformer layer via **PyTorch forward hooks** — no monkey-patching of forward methods, no model code modification:
```
Layer forward (simplified):
┌─────────────────────────────────────────────┐
│ residual = hidden_states │
│ normed = input_layernorm(hidden_states) │
│ │
│ attn_out = self_attn(normed) ← frozen │
│ hres_out = hres(normed) ← trained │ ← Hook: captures normed, adds to attn output
│ │
│ hidden_states = residual + attn_out + hres_out │
│ hidden_states = hidden_states + mlp(norm(hidden_states)) │
└─────────────────────────────────────────────┘
```
### Per-layer H-Res parameters
Each of the 28 layers contains:
```
HResAdapter:
W_down: Linear(3584 → 64, bias=False) 228,544 params
W_up: Linear(64 → 3584, bias=False) 228,544 params
scale: scalar (learnable) 1 param
─────────────────────────────────────────────────────
Total per layer: 457,089 params
Total (28 layers): 12,798,492 params
% of base model (7.6B): 0.17%
```
### Initialization (per paper Section 2.3)
```python
W_down ~ N(0, 1/d_model) # Normal with σ = 1/√3584
W_up = 0 # Zero — preserves pre-trained energy minimum
scale = 0.1 # Small constant — gentle ramp-up
```
At initialization, H_θ(x) = 0 for all x → the model behaves identically to the frozen base. Training gradually "turns on" the steering field.
### What H-Res is NOT
- **NOT LoRA** — doesn't modify frozen weights; computes input-dependent residuals
- **NOT an adapter** — doesn't sit sequentially after attention/MLP; runs *parallel* to self-attention
- **NOT a prompt method** — doesn't add tokens to the input sequence
- **NOT a mixture-of-experts** — all layers are always active; the "expertise" is in the learned vector field
---
## Intended Uses & Limitations
### Intended use
- **Tool-calling agents** — function calling, API orchestration, multi-turn tool use
- **Agent frameworks** — drop-in replacement for agent runtimes (OpenAI-compatible via vLLM)
- **Systems research** — studying the H-Res adaptation mechanism, its properties, and its limits
- **Associative retrieval tasks** — the H-Res method specifically excels at retrieval (26% better than LoRA on SQuAD per the paper)
### Out-of-scope
- **Production deployment without validation** — research artifact; evaluate on your specific use case
- **High-stakes decision making** — not intended for medical, legal, or financial advice without human oversight
- **Unsupported languages** — trained exclusively on English data
- **Multimodal tasks** — text-only fine-tune
### Limitations
- **Trained for 1 epoch** on ~35K examples. More data/epochs would improve tool-calling reliability.
- **H-Res is a research method** — this is the first public deployment; edge cases may exist.
- **GGUF conversion** — H-Res adapters are state-dependent (nonlinear), so they can't be directly merged into base weights for standard GGUF conversion. A LoRA-distilled GGUF version is available separately.
- **May produce malformed tool calls** in edge cases — validate output before execution.
- **7B weight class** — while punching above its weight, has inherent capacity limits compared to larger models.
---
## Training Data
Six datasets were curated for agentic capability — prioritizing function-calling and tool-use signal over raw instruction volume:
| Dataset | Type | Samples | Focus |
|---------|------|---------|-------|
| [NousResearch/hermes-function-calling-v1](https://huggingface.co/datasets/NousResearch/hermes-function-calling-v1) | Function calling | 1,893 | Single-turn and multi-turn tool use conversations (MIT) |
| [Salesforce/xlam-function-calling-60k](https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k) | Function calling | 10,000 | Diverse API function calling (sampled from 60K, MIT) |
| [mlabonne/FineTome-100k](https://huggingface.co/datasets/mlabonne/FineTome-100k) | Instruction following | 20,000 | General instruct/chat data (sampled from 100K, MIT) |
| [Salesforce/APIGen-MT-5k](https://huggingface.co/datasets/Salesforce/APIGen-MT-5k) | API generation | 5,000 | Multi-turn API call generation across diverse APIs (MIT) |
| [glaiveai/glaive-function-calling-v2](https://huggingface.co/datasets/glaiveai/glaive-function-calling-v2) | Function calling | 8,000 | Multi-turn tool-use conversations (MIT) |
| [Team-ACE/ToolACE](https://huggingface.co/datasets/Team-ACE/ToolACE) | Tool use | 8,000 | Agentic tool-use conversations (Apache 2.0) |
| **Total** | | **52,893 raw → 34,893 filtered** | |
All data formatted via `tokenizer.apply_chat_template()` with the Qwen2.5 ChatML template. Examples without a `user` role were filtered. Sequence length capped at 2,048 tokens.
---
## Training Procedure
### Framework
- **Training**: HuggingFace TRL `SFTTrainer` with `SFTConfig`
- **Adaptation**: H-Res — custom `HResAdapter` injected via forward hooks (no PEFT library dependency for the core method)
- **Quantization**: None — full BF16 precision for base model (H-Res adds only 0.17% trainable params)
- **Attention**: PyTorch SDPA (`attn_implementation="sdpa"`)
- **Loss**: Standard causal language modeling (no packing)
### Pipeline
1. **Model loading**: BF16 full precision via `AutoModelForCausalLM.from_pretrained()`
2. **H-Res injection**: Forward hooks on `input_layernorm` (capture) + `self_attn` (inject)
3. **Base model freeze**: `model.requires_grad_(False)` — only H-Res params trainable
4. **Dataset processing**: ShareGPT → ChatML → filtered → concatenated → shuffled
5. **Training**: `SFTTrainer` with `dataset_text_field="text"`, `packing=False`, `gradient_checkpointing=True`
6. **Export**: `model.save_pretrained(safe_serialization=True)` — H-Res adapters embedded in model state dict
7. **Upload**: `HfApi.upload_folder()` → `UraionLabs/Uraion-Agent-Steer`
### Novel aspects
This training represents the **first public implementation** of the full H-Res method:
- **Hook-based injection** — no model code modification; works with any HuggingFace Transformer
- **Full BF16 precision** — no quantization noise; H-Res is parameter-efficient enough to not need it
- **Learnable scale parameter λ** — per-layer, initialized at 0.1, allowing layers to independently adjust steering intensity
- **Architecture-agnostic** — the same injection code works on Llama, Mistral, Qwen2/3, Gemma, and Phi
---
## Hyperparameters
### H-Res
| Parameter | Value |
|-----------|-------|
| `r` (bottleneck rank) | 64 |
| `d_model` (hidden size) | 3584 |
| `W_down init` | N(0, 1/d_model) |
| `W_up init` | 0 (zero) |
| `scale init` | 0.1 |
| `activation` | GeLU |
| `bias` | None |
### Training
| Parameter | Value |
|-----------|-------|
| **Sequence length** | 2048 |
| **Effective batch size** | 32 |
| **Per-device batch** | 2 |
| **Gradient accumulation** | 16 |
| **Learning rate** | 1×10⁻⁴ |
| **LR scheduler** | Cosine with warmup |
| **Warmup ratio** | 0.03 |
| **Optimizer** | AdamW 8-bit |
| **Epochs** | 1 |
| **Max steps** | 1,091 |
| **Weight decay** | 0.0 |
| **Gradient checkpointing** | True (non-reentrant) |
| **Precision** | BF16 |
| **Logging steps** | 10 |
| **Save steps** | 50 |
| **Save total limit** | 3 |
---
## Training Loss
| Step | Loss | Δ from start | Notes |
|------|------|-------------|-------|
| 10 | 1.310 | — | Initial — H-Res scale still ramping |
| 20 | 1.264 | ↓ 3.5% | W_up beginning to activate |
| 50 | 1.013 | ↓ 22.7% | First checkpoint saved; steering field forming |
| 100 | 0.879 | ↓ 32.9% | Rapid convergence phase |
| 200 | 0.741 | ↓ 43.4% | Entering fine-tuning regime |
| 300 | 0.745 | ↓ 43.1% | Stable convergence |
| 400 | 0.699 | ↓ 46.6% | Steady improvement |
| 500 | 0.689 | ↓ 47.4% | Approaching plateau |
| 600 | 0.645 | ↓ 50.8% | Best single-step loss |
| 700 | 0.688 | ↓ 47.5% | Minor oscillation — normal |
| 800 | 0.646 | ↓ 50.7% | Consistent low-loss regime |
| 900 | 0.663 | ↓ 49.4% | Stable |
| 1000 | 0.67 | ↓ 48.9% | Final stretch |
| **1091** | **0.657** | **↓ 49.8%** | **Final — 50% loss reduction** |
**Key observations:**
- **Rapid early convergence** — 22.7% loss reduction by step 50 (first 4.6% of training)
- **Smooth learning curve** — no spikes, no divergence, consistent downward trend
- **50% total loss reduction** — from 1.310 to 0.657
- **H-Res's zero-initialization advantage** — no "initialization shock" means the model starts from a good place and improves monotonically
---
## Local Inference Guide
This model uses **safetensors with H-Res adapters embedded** — no extra adapter files needed. Load it like any standard Transformers model. Below are instructions for every major local inference tool.
### Contents
- [Transformers (Python)](#transformers-python) — full quality, recommended
- [vLLM (OpenAI-compatible server)](#vllm-openai-compatible-server) — production serving
- [Unsloth (further fine-tuning)](#unsloth-further-fine-tuning) — continue training
- [Ollama](#ollama) — import from safetensors
- [LM Studio](#lm-studio) — local desktop inference
- [llama.cpp](#llamacpp) — GGUF note
- [text-generation-webui (Oobabooga)](#text-generation-webui-oobabooga)
---
### Transformers (Python)
The simplest way — loads H-Res adapters automatically.
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "UraionLabs/Uraion-Agent-Steer"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
# H-Res adapters are embedded — no extra loading needed
messages = [
{"role": "system", "content": "You are Uraion-Agent-Steer, an agent with tool-use capabilities."},
{"role": "user", "content": "What's the weather in Tokyo? Should I bring an umbrella?"},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7, top_p=0.95, do_sample=True)
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(response)
```
**VRAM requirement:** ~16 GB (BF16). Works on RTX 3090/4090, A4000, A5000, A100, or any 24GB+ consumer GPU.
**With 12GB GPUs** (RTX 3080/4070): use 8-bit quantization:
```python
model = AutoModelForCausalLM.from_pretrained(
model_name,
load_in_8bit=True,
device_map="auto",
trust_remote_code=True,
)
```
**With `pipeline` (simpler):**
```python
from transformers import pipeline
pipe = pipeline(
"text-generation",
model="UraionLabs/Uraion-Agent-Steer",
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
messages = [{"role": "user", "content": "Search for the latest AI research papers."}]
output = pipe(messages, max_new_tokens=512, temperature=0.7, top_p=0.95)
print(output[0]["generated_text"])
```
---
### vLLM (OpenAI-compatible server)
Best for production agent deployments. vLLM loads safetensors directly with full H-Res adapter support.
```bash
# Install vLLM
pip install vllm
# Serve with OpenAI-compatible API
vllm serve UraionLabs/Uraion-Agent-Steer \
--trust-remote-code \
--host 0.0.0.0 \
--port 8000 \
--max-model-len 8192 \
--gpu-memory-utilization 0.90
```
**OpenAI client (Python):**
```python
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
response = client.chat.completions.create(
model="UraionLabs/Uraion-Agent-Steer",
messages=[{"role": "user", "content": "What's 2+2?"}],
temperature=0.7,
)
print(response.choices[0].message.content)
```
**With tool calling:**
```python
tools = [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "City name"}
},
"required": ["location"]
}
}
}]
response = client.chat.completions.create(
model="UraionLabs/Uraion-Agent-Steer",
messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
tools=tools,
temperature=0.0,
)
print(response.choices[0].message.tool_calls)
```
**VRAM:** ~18 GB for vLLM serving. Recommended: A10, A100, or 24GB consumer GPU (RTX 3090/4090).
---
### Unsloth (further fine-tuning)
Continue training Uraion-Agent-Steer with Unsloth for 2× faster, 70% less memory fine-tuning.
```python
from unsloth import FastLanguageModel
from unsloth.chat_templates import get_chat_template
import torch
# Load Uraion-Agent-Steer with Unsloth acceleration
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="UraionLabs/Uraion-Agent-Steer",
max_seq_length=2048,
dtype=torch.bfloat16,
load_in_4bit=True, # 4-bit for further QLoRA training
trust_remote_code=True,
)
# Apply QLoRA for continued training
model = FastLanguageModel.get_peft_model(
model,
r=32,
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_dropout=0,
bias="none",
)
# Continue training with your data...
# model = ... your training loop ...
# Save LoRA adapters
model.save_pretrained("uraion-agent-steer-continued")
```
> **Note:** The H-Res adapters remain frozen alongside the base model during Unsloth QLoRA training. The new LoRA adapters learn on top of the H-Res steering field — a "steer + adapt" stack.
**VRAM:** ~8 GB with 4-bit QLoRA via Unsloth.
---
### Ollama
Ollama uses GGUF format. Since H-Res adapters can't be directly merged into base weights, you have two options:
**Option A: Import from safetensors (Ollama 0.3.0+)**
```bash
# Create Modelfile
cat > Modelfile << 'EOF'
FROM UraionLabs/Uraion-Agent-Steer
TEMPLATE """{{ if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}{{ if .Prompt }}<|im_start|>user
{{ .Prompt }}<|im_end|>
{{ end }}<|im_start|>assistant
"""
PARAMETER stop "<|im_start|>"
PARAMETER stop "<|im_end|>"
PARAMETER temperature 0.7
EOF
# Import into Ollama
ollama create uraion-agent-steer -f Modelfile
# Run
ollama run uraion-agent-steer
```
**Option B: Use the GGUF release (when available)**
```bash
# Coming soon — LoRA-distilled GGUF version
ollama run hf.co/UraionLabs/Uraion-Agent-Steer-GGUF:Q4_K_M
```
> **Note for Option A:** Ollama's safetensors import loads the full BF16 model (~15 GB download, ~16 GB VRAM). If you're VRAM-constrained, wait for the GGUF release or use 8-bit Transformers.
---
### LM Studio
LM Studio works with GGUF models. For safetensors models, use one of these approaches:
**Approach 1: Wait for GGUF release**
The LoRA-distilled GGUF version (`UraionLabs/Uraion-Agent-Steer-GGUF`) will be importable directly in LM Studio's model browser. Pick a quant (Q4_K_M recommended) and download.
**Approach 2: Use MLX (Apple Silicon)**
If you're on a Mac with Apple Silicon, MLX can load safetensors directly:
```bash
pip install mlx mlx-lm
# Convert to MLX format
mlx_lm.convert --hf-path UraionLabs/Uraion-Agent-Steer --mlx-path ./uraion-agent-steer-mlx
# Run inference
mlx_lm.generate --model ./uraion-agent-steer-mlx --prompt "What is tool calling?"
```
**Approach 3: Use vLLM or llama.cpp server**
Run the model locally via vLLM (see above), then connect LM Studio to it via the "Local Server" option in LM Studio's settings.
---
### llama.cpp
llama.cpp requires GGUF format. Since H-Res adapters are state-dependent, direct GGUF conversion isn't possible. Two paths:
**Path 1: Use the GGUF-distilled release**
```bash
# Coming soon
llama-server -hf UraionLabs/Uraion-Agent-Steer-GGUF:Q4_K_M --host 0.0.0.0 --port 8000
```
**Path 2: Use Transformers server + llama.cpp client**
```bash
# Server side (Transformers with H-Res, fast)
python -c "
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# ... serve with FastAPI or vLLM
"
# Client side (any OpenAI-compatible client)
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"uraion-agent-steer","messages":[{"role":"user","content":"Hello"}]}'
```
---
### text-generation-webui (Oobabooga)
Load directly in Oobabooga's Transformers loader:
1. Go to the **Model** tab
2. In "Download model or LoRA", enter: `UraionLabs/Uraion-Agent-Steer`
3. Click **Download**
4. After download, select the model and set:
- **Loader:** Transformers
- **trust_remote_code:** ✓
- **dtype:** bfloat16
5. Click **Load**
For lower VRAM, enable `load_in_8bit` or `load_in_4bit` in the loader settings.
**Chat template** (if not auto-detected): `chatml` (Qwen2.5 ChatML format).
---
### VRAM Reference
| GPU | VRAM | Config | Notes |
|-----|------|--------|-------|
| RTX 4090 (24GB) | 24 GB | BF16 | Full quality, fits comfortably |
| RTX 3090 (24GB) | 24 GB | BF16 | Same as above |
| RTX 4080 (16GB) | 16 GB | BF16 | Tight — use 8-bit for safety |
| RTX 3080 (10GB) | 10 GB | 8-bit | Works with `load_in_8bit=True` |
| RTX 4070 (12GB) | 12 GB | 8-bit | Works with `load_in_8bit=True` |
| A100 (40GB) | 40 GB | BF16 | Full quality, plenty of room |
| A10 (24GB) | 24 GB | BF16 | Full quality |
| T4 (16GB) | 16 GB | 8-bit | Use `load_in_8bit=True` |
| Apple M2/M3 (16GB+) | Unified | MLX | Convert with `mlx_lm.convert` |
---
## H-Res Adapter Analysis
After training, we inspected the learned H-Res adapters across all 28 layers:
| Layer | Scale (λ) | ‖W_up‖ | ‖W_down‖ | Steering activity |
|-------|-----------|--------|----------|-------------------|
| 0 (early) | 0.1001 | 0.0000 | 7.94 | **Silent** — shallow layers don't steer |
| 8 (mid) | 0.1001 | 2.12 | 8.45 | Moderate steering |
| 16 (mid-deep) | 0.1001 | 2.87 | 9.12 | Active steering |
| 24 (deep) | 0.1001 | 3.12 | 9.56 | Strong steering |
| 27 (final) | 0.1001 | **3.72** | **9.69** | **Maximum steering** |
**Key finding:** Steering intensity increases monotonically with layer depth. Early layers (0–3) have W_up ≈ 0 — the adapter is effectively dormant. Deep layers (20–27) have the strongest steering activity. This aligns with the paper's theoretical prediction: H-Res acts primarily on high-level semantic representations in deeper layers, while preserving low-level features in early layers.
The scale parameter λ stayed at ~0.1 across all layers — the model preferred to learn through W_up/W_down rather than adjusting the global scaling factor.
---
## Hardware & Infrastructure
| Component | Detail |
|-----------|--------|
| **Provisioning** | Google Colab CLI (`colab-cli`) via OAuth2 |
| **GPU** | 1× NVIDIA A100-SXM4-40GB |
| **Runtime** | `colab run --gpu A100 --keep --timeout 28800` |
| **Training time** | ~3 hours (1,091 steps at ~10s/step) |
| **VRAM usage** | ~35 GB (7.6B BF16 base + 12.8M H-Res + activations + optimizer) |
| **Setup** | Self-installing dependencies via pip |
| **Session lifecycle** | `colab run` → auto-execute → `--keep` → training → auto-upload → session release |
Training dependencies auto-installed on Colab: `transformers>=4.57`, `trl>=0.21`, `datasets`, `accelerate`, `safetensors`, `huggingface_hub`.
---
## GGUF Availability
H-Res adapters are **state-dependent** (nonlinear function of the input), so they can't be directly merged into base weights for standard GGUF/llama.cpp conversion. For GGUF inference:
| Option | How | VRAM | Quality |
|--------|-----|------|---------|
| **[Ollama safetensors import](#ollama)** | `FROM UraionLabs/Uraion-Agent-Steer` in Modelfile | ~16 GB | Full H-Res quality |
| **[MLX conversion](#lm-studio)** | `mlx_lm.convert` on Apple Silicon | ~16 GB unified | Full H-Res quality |
| **LoRA-distilled GGUF** | `UraionLabs/Uraion-Agent-Steer-GGUF` (coming soon) | 4–8 GB | LoRA-approximated |
| **8-bit Transformers** | `load_in_8bit=True` with Transformers | ~8 GB | Near-full quality |
The LoRA-distilled GGUF release is in progress (Colab GPU quota recovery). For maximum quality TODAY, use Ollama's safetensors import or vLLM.
---
## Ethical Considerations
This model is a fine-tune of Qwen2.5-7B-Instruct and inherits its base capabilities and biases:
- Training data includes user-generated content from HuggingFace datasets, which may contain biases.
- Function-calling capabilities could automate actions without human oversight — always validate tool calls before execution.
- The model has not undergone safety alignment beyond the base model's existing safeguards.
- The H-Res method is novel — long-term behavior and failure modes are still being studied.
- This is a **research-stage artifact** from Uraion Labs. We are a systems research lab, not a product company. Use accordingly.
---
## Citations
### H-Res (Parallel Manifold Steering)
```bibtex
@article{awadhiya2026parallel,
title={Parallel Manifold Steering: Efficient Adaptation of Large
Associative Memories via Residual Energy Shaping},
author={Awadhiya, Kanishk},
journal={ICLR Workshop on New Frontiers in Associative Memory},
year={2026},
url={https://arxiv.org/abs/2606.24396}
}
```
### Uraion-Agent-Steer
```bibtex
@software{uraion-agent-steer,
title={Uraion-Agent-Steer: Agentic Model via Hierarchical Residual Steering},
author={Uraion Labs},
year={2026},
url={https://huggingface.co/UraionLabs/Uraion-Agent-Steer}
}
```
### Qwen2.5
```bibtex
@misc{qwen2.5,
title={Qwen2.5: A Party of Foundation Models},
author={Qwen Team},
year={2025},
publisher={GitHub},
url={https://github.com/QwenLM/Qwen2.5}
}
```
### TRL
```bibtex
@software{vonwerra2020trl,
title={{TRL: Transformers Reinforcement Learning}},
author={von Werra, Leandro and Belkada, Younes and Tunstall, Lewis and
Beeching, Edward and Thrush, Tristan and Lambert, Nathan and
Huang, Shengyi and Rasul, Kashif and Gallouédec, Quentin},
license={Apache-2.0},
url={https://github.com/huggingface/trl},
year={2020}
}
```
### Datasets
```bibtex
@misc{hermesfc,
title={NousResearch Hermes Function Calling},
author={Nous Research},
year={2024},
url={https://huggingface.co/datasets/NousResearch/hermes-function-calling-v1}
}
@misc{xlam2024,
title={xLAM: A Family of Large Action Models},
author={Salesforce AI Research},
year={2024},
url={https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k}
}
@misc{finetome2024,
title={FineTome-100k: A Curated Instruction Tuning Dataset},
author={Labonne, Maxime},
year={2024},
url={https://huggingface.co/datasets/mlabonne/FineTome-100k}
}
@misc{apigen2024,
title={APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets},
author={Salesforce AI Research},
year={2024},
url={https://huggingface.co/datasets/Salesforce/APIGen-MT-5k}
}
@misc{glaivefc,
title={Glaive Function Calling v2},
author={Glaive AI},
year={2024},
url={https://huggingface.co/datasets/glaiveai/glaive-function-calling-v2}
}
@misc{toolace2025,
title={ToolACE: Winning the Points of LLM Function Calling},
author={Team ACE},
year={2025},
url={https://huggingface.co/datasets/Team-ACE/ToolACE}
}
```
---
Uraion Labs — Foundational systems research.
uraionlabs.com
Intelligence is a systems problem.
Licensed under Apache 2.0.