--- base_model: Qwen/Qwen2.5-7B-Instruct base_model_relation: finetune library_name: transformers license: apache-2.0 language: - en pipeline_tag: text-generation tags: - agent - function-calling - tool-use - h-res - manifold-steering - peft - uraion-labs - uraion - iclr-2026 - associative-memory - hopfield - neural-collapse - qwen2.5 - sft - trl - hermes-function-calling - apigen - xlam - toolace datasets: - NousResearch/hermes-function-calling-v1 - Salesforce/xlam-function-calling-60k - mlabonne/FineTome-100k - Salesforce/APIGen-MT-5k - glaiveai/glaive-function-calling-v2 - Team-ACE/ToolACE inference: parameters: temperature: 0.7 top_p: 0.95 max_new_tokens: 4096 ---

Uraion Labs

Uraion Labs
Foundational systems research.

Uraion-Agent-Steer
Agentic LLM fine-tuned via Hierarchical Residual Steering (H-Res) — steers activations, not weights.

--- **Uraion-Agent-Steer** is a 7-billion parameter model adapted from [Qwen/Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) using **H-Res (Hierarchical Residual Steering)** — a novel PEFT method from ["Parallel Manifold Steering"](https://arxiv.org/abs/2606.24396) (ICLR Workshop 2026). Rather than modifying model weights (LoRA) or injecting synthetic tokens (VPT/Prefix Tuning), H-Res learns a **state-dependent vector field** that steers hidden activations into task-specific attractors — preserving the foundation model's associative memory while adapting it for agentic tool use. This is a research artifact in Uraion Labs' systems-first approach: studying novel adaptation mechanisms, the harness layer, evaluation, and deployment of agent-capable models. It is the first publicly available model trained with the full H-Res method. **Intelligence is a systems problem.** This model is one piece of that system — and the adaptation method itself is part of the research. --- ## The H-Res Method ### The problem with existing PEFT | Method | Mechanism | Fatal flaw | |--------|-----------|------------| | **LoRA** | Modifies weights globally | Catastrophic interference — distorts retrieval dynamics of pre-trained memories | | **VPT / Prefix Tuning** | Appends synthetic tokens to input | Buffer congestion — dilutes attention probability mass, weakens associative recall | | **H-Res** | Steers activations via vector field | *None of the above* — operates orthogonal to weights and input buffer | ### How H-Res works H-Res frames Transformer adaptation as a **control problem on the activation manifold**. Each layer `l` receives a state-dependent residual: ``` z_{l+1} = Attn(z_l) + FFN(z_l) + λ · H_θ(z_l) where H_θ(x) = W_up · GeLU(W_down · x) ``` - **W_down ∈ ℝ^{d×r}** — projects to a low-rank "control manifold" (bottleneck) - **W_up ∈ ℝ^{r×d}** — projects the steering signal back to activation space - **W_up initialized to zero** — no initialization shock; training starts from the pre-trained energy minimum - **λ** — learnable per-layer scaling factor - **Applied parallel to self-attention** — via forward hooks, orthogonal to the frozen backbone ### Theoretical guarantees (from the paper) | Property | Proof | |----------|-------| | **Attention entropy preserved** | No synthetic tokens → constant sequence length → H(A_cls) minimal | | **Neural Collapse facilitated** | Residual adapter acts as Maxwell's Demon, filtering task-irrelevant noise | | **Zero initialization** | W_up = 0 → H_θ(z) = 0 at t=0 → training starts from global energy minimum | | **SSM-compatible** | Operates entirely in residual stream — compatible with Mamba, S4, DeltaNet | | **Multi-task orthogonality** | Null-Space Projection of gradients across tasks (Eq. 6 in paper) | --- ## Contents - [Model Details](#model-details) - [H-Res Architecture (Deep Dive)](#h-res-architecture-deep-dive) - [Intended Uses & Limitations](#intended-uses--limitations) - [Training Data](#training-data) - [Training Procedure](#training-procedure) - [Hyperparameters](#hyperparameters) - [Training Loss](#training-loss) - [Quickstart](#quickstart) - [H-Res Adapter Analysis](#h-res-adapter-analysis) - [Hardware & Infrastructure](#hardware--infrastructure) - [GGUF Availability](#gguf-availability) - [Ethical Considerations](#ethical-considerations) - [Citations](#citations) --- ## Model Details | Property | Value | |----------|-------| | **Base model** | [Qwen/Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) | | **Architecture** | Qwen2.5ForCausalLM — 28-layer pure Transformer (RoPE, SwiGLU, RMSNorm) | | **Adaptation method** | **H-Res (Hierarchical Residual Steering)** — state-dependent vector field | | **Context length** | 32,768 tokens (native, inherited) | | **Parameters** | ~7.6B total, 12.8M H-Res trainable (0.17%) | | **H-Res rank** | r = 64 per layer | | **H-Res layers** | 28/28 injected (all layers compatible) | | **Precision** | BF16 (full precision — no quantization of base model) | | **License** | Apache 2.0 (inherited from Qwen2.5) | | **On-disk size** | ~15.3 GB (BF16 safetensors) | | **Paper** | [arXiv:2606.24396](https://arxiv.org/abs/2606.24396) — ICLR Workshop 2026 | ### Architecture choice Qwen2.5-7B-Instruct was chosen for this H-Res implementation because: 1. **Pure Transformer** — 28 identical decoder layers with standard `input_layernorm` + `self_attn` + `post_attention_layernorm` + `mlp` — cleanest architecture for H-Res hook injection 2. **Apache 2.0 license** — no gated access, no approval required, fully open 3. **Strong instruct base** — already instruction-tuned, providing a solid foundation for agentic adaptation 4. **7B weight class** — punches above its weight on agent benchmarks while fitting comfortably on A100-40GB --- ## H-Res Architecture (Deep Dive) ### Injection mechanism H-Res adapters are injected into each transformer layer via **PyTorch forward hooks** — no monkey-patching of forward methods, no model code modification: ``` Layer forward (simplified): ┌─────────────────────────────────────────────┐ │ residual = hidden_states │ │ normed = input_layernorm(hidden_states) │ │ │ │ attn_out = self_attn(normed) ← frozen │ │ hres_out = hres(normed) ← trained │ ← Hook: captures normed, adds to attn output │ │ │ hidden_states = residual + attn_out + hres_out │ │ hidden_states = hidden_states + mlp(norm(hidden_states)) │ └─────────────────────────────────────────────┘ ``` ### Per-layer H-Res parameters Each of the 28 layers contains: ``` HResAdapter: W_down: Linear(3584 → 64, bias=False) 228,544 params W_up: Linear(64 → 3584, bias=False) 228,544 params scale: scalar (learnable) 1 param ───────────────────────────────────────────────────── Total per layer: 457,089 params Total (28 layers): 12,798,492 params % of base model (7.6B): 0.17% ``` ### Initialization (per paper Section 2.3) ```python W_down ~ N(0, 1/d_model) # Normal with σ = 1/√3584 W_up = 0 # Zero — preserves pre-trained energy minimum scale = 0.1 # Small constant — gentle ramp-up ``` At initialization, H_θ(x) = 0 for all x → the model behaves identically to the frozen base. Training gradually "turns on" the steering field. ### What H-Res is NOT - **NOT LoRA** — doesn't modify frozen weights; computes input-dependent residuals - **NOT an adapter** — doesn't sit sequentially after attention/MLP; runs *parallel* to self-attention - **NOT a prompt method** — doesn't add tokens to the input sequence - **NOT a mixture-of-experts** — all layers are always active; the "expertise" is in the learned vector field --- ## Intended Uses & Limitations ### Intended use - **Tool-calling agents** — function calling, API orchestration, multi-turn tool use - **Agent frameworks** — drop-in replacement for agent runtimes (OpenAI-compatible via vLLM) - **Systems research** — studying the H-Res adaptation mechanism, its properties, and its limits - **Associative retrieval tasks** — the H-Res method specifically excels at retrieval (26% better than LoRA on SQuAD per the paper) ### Out-of-scope - **Production deployment without validation** — research artifact; evaluate on your specific use case - **High-stakes decision making** — not intended for medical, legal, or financial advice without human oversight - **Unsupported languages** — trained exclusively on English data - **Multimodal tasks** — text-only fine-tune ### Limitations - **Trained for 1 epoch** on ~35K examples. More data/epochs would improve tool-calling reliability. - **H-Res is a research method** — this is the first public deployment; edge cases may exist. - **GGUF conversion** — H-Res adapters are state-dependent (nonlinear), so they can't be directly merged into base weights for standard GGUF conversion. A LoRA-distilled GGUF version is available separately. - **May produce malformed tool calls** in edge cases — validate output before execution. - **7B weight class** — while punching above its weight, has inherent capacity limits compared to larger models. --- ## Training Data Six datasets were curated for agentic capability — prioritizing function-calling and tool-use signal over raw instruction volume: | Dataset | Type | Samples | Focus | |---------|------|---------|-------| | [NousResearch/hermes-function-calling-v1](https://huggingface.co/datasets/NousResearch/hermes-function-calling-v1) | Function calling | 1,893 | Single-turn and multi-turn tool use conversations (MIT) | | [Salesforce/xlam-function-calling-60k](https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k) | Function calling | 10,000 | Diverse API function calling (sampled from 60K, MIT) | | [mlabonne/FineTome-100k](https://huggingface.co/datasets/mlabonne/FineTome-100k) | Instruction following | 20,000 | General instruct/chat data (sampled from 100K, MIT) | | [Salesforce/APIGen-MT-5k](https://huggingface.co/datasets/Salesforce/APIGen-MT-5k) | API generation | 5,000 | Multi-turn API call generation across diverse APIs (MIT) | | [glaiveai/glaive-function-calling-v2](https://huggingface.co/datasets/glaiveai/glaive-function-calling-v2) | Function calling | 8,000 | Multi-turn tool-use conversations (MIT) | | [Team-ACE/ToolACE](https://huggingface.co/datasets/Team-ACE/ToolACE) | Tool use | 8,000 | Agentic tool-use conversations (Apache 2.0) | | **Total** | | **52,893 raw → 34,893 filtered** | | All data formatted via `tokenizer.apply_chat_template()` with the Qwen2.5 ChatML template. Examples without a `user` role were filtered. Sequence length capped at 2,048 tokens. --- ## Training Procedure ### Framework - **Training**: HuggingFace TRL `SFTTrainer` with `SFTConfig` - **Adaptation**: H-Res — custom `HResAdapter` injected via forward hooks (no PEFT library dependency for the core method) - **Quantization**: None — full BF16 precision for base model (H-Res adds only 0.17% trainable params) - **Attention**: PyTorch SDPA (`attn_implementation="sdpa"`) - **Loss**: Standard causal language modeling (no packing) ### Pipeline 1. **Model loading**: BF16 full precision via `AutoModelForCausalLM.from_pretrained()` 2. **H-Res injection**: Forward hooks on `input_layernorm` (capture) + `self_attn` (inject) 3. **Base model freeze**: `model.requires_grad_(False)` — only H-Res params trainable 4. **Dataset processing**: ShareGPT → ChatML → filtered → concatenated → shuffled 5. **Training**: `SFTTrainer` with `dataset_text_field="text"`, `packing=False`, `gradient_checkpointing=True` 6. **Export**: `model.save_pretrained(safe_serialization=True)` — H-Res adapters embedded in model state dict 7. **Upload**: `HfApi.upload_folder()` → `UraionLabs/Uraion-Agent-Steer` ### Novel aspects This training represents the **first public implementation** of the full H-Res method: - **Hook-based injection** — no model code modification; works with any HuggingFace Transformer - **Full BF16 precision** — no quantization noise; H-Res is parameter-efficient enough to not need it - **Learnable scale parameter λ** — per-layer, initialized at 0.1, allowing layers to independently adjust steering intensity - **Architecture-agnostic** — the same injection code works on Llama, Mistral, Qwen2/3, Gemma, and Phi --- ## Hyperparameters ### H-Res | Parameter | Value | |-----------|-------| | `r` (bottleneck rank) | 64 | | `d_model` (hidden size) | 3584 | | `W_down init` | N(0, 1/d_model) | | `W_up init` | 0 (zero) | | `scale init` | 0.1 | | `activation` | GeLU | | `bias` | None | ### Training | Parameter | Value | |-----------|-------| | **Sequence length** | 2048 | | **Effective batch size** | 32 | | **Per-device batch** | 2 | | **Gradient accumulation** | 16 | | **Learning rate** | 1×10⁻⁴ | | **LR scheduler** | Cosine with warmup | | **Warmup ratio** | 0.03 | | **Optimizer** | AdamW 8-bit | | **Epochs** | 1 | | **Max steps** | 1,091 | | **Weight decay** | 0.0 | | **Gradient checkpointing** | True (non-reentrant) | | **Precision** | BF16 | | **Logging steps** | 10 | | **Save steps** | 50 | | **Save total limit** | 3 | --- ## Training Loss | Step | Loss | Δ from start | Notes | |------|------|-------------|-------| | 10 | 1.310 | — | Initial — H-Res scale still ramping | | 20 | 1.264 | ↓ 3.5% | W_up beginning to activate | | 50 | 1.013 | ↓ 22.7% | First checkpoint saved; steering field forming | | 100 | 0.879 | ↓ 32.9% | Rapid convergence phase | | 200 | 0.741 | ↓ 43.4% | Entering fine-tuning regime | | 300 | 0.745 | ↓ 43.1% | Stable convergence | | 400 | 0.699 | ↓ 46.6% | Steady improvement | | 500 | 0.689 | ↓ 47.4% | Approaching plateau | | 600 | 0.645 | ↓ 50.8% | Best single-step loss | | 700 | 0.688 | ↓ 47.5% | Minor oscillation — normal | | 800 | 0.646 | ↓ 50.7% | Consistent low-loss regime | | 900 | 0.663 | ↓ 49.4% | Stable | | 1000 | 0.67 | ↓ 48.9% | Final stretch | | **1091** | **0.657** | **↓ 49.8%** | **Final — 50% loss reduction** | **Key observations:** - **Rapid early convergence** — 22.7% loss reduction by step 50 (first 4.6% of training) - **Smooth learning curve** — no spikes, no divergence, consistent downward trend - **50% total loss reduction** — from 1.310 to 0.657 - **H-Res's zero-initialization advantage** — no "initialization shock" means the model starts from a good place and improves monotonically --- ## Local Inference Guide This model uses **safetensors with H-Res adapters embedded** — no extra adapter files needed. Load it like any standard Transformers model. Below are instructions for every major local inference tool. ### Contents - [Transformers (Python)](#transformers-python) — full quality, recommended - [vLLM (OpenAI-compatible server)](#vllm-openai-compatible-server) — production serving - [Unsloth (further fine-tuning)](#unsloth-further-fine-tuning) — continue training - [Ollama](#ollama) — import from safetensors - [LM Studio](#lm-studio) — local desktop inference - [llama.cpp](#llamacpp) — GGUF note - [text-generation-webui (Oobabooga)](#text-generation-webui-oobabooga) --- ### Transformers (Python) The simplest way — loads H-Res adapters automatically. ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "UraionLabs/Uraion-Agent-Steer" tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True, ) # H-Res adapters are embedded — no extra loading needed messages = [ {"role": "system", "content": "You are Uraion-Agent-Steer, an agent with tool-use capabilities."}, {"role": "user", "content": "What's the weather in Tokyo? Should I bring an umbrella?"}, ] text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = tokenizer(text, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7, top_p=0.95, do_sample=True) response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True) print(response) ``` **VRAM requirement:** ~16 GB (BF16). Works on RTX 3090/4090, A4000, A5000, A100, or any 24GB+ consumer GPU. **With 12GB GPUs** (RTX 3080/4070): use 8-bit quantization: ```python model = AutoModelForCausalLM.from_pretrained( model_name, load_in_8bit=True, device_map="auto", trust_remote_code=True, ) ``` **With `pipeline` (simpler):** ```python from transformers import pipeline pipe = pipeline( "text-generation", model="UraionLabs/Uraion-Agent-Steer", torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True, ) messages = [{"role": "user", "content": "Search for the latest AI research papers."}] output = pipe(messages, max_new_tokens=512, temperature=0.7, top_p=0.95) print(output[0]["generated_text"]) ``` --- ### vLLM (OpenAI-compatible server) Best for production agent deployments. vLLM loads safetensors directly with full H-Res adapter support. ```bash # Install vLLM pip install vllm # Serve with OpenAI-compatible API vllm serve UraionLabs/Uraion-Agent-Steer \ --trust-remote-code \ --host 0.0.0.0 \ --port 8000 \ --max-model-len 8192 \ --gpu-memory-utilization 0.90 ``` **OpenAI client (Python):** ```python from openai import OpenAI client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed") response = client.chat.completions.create( model="UraionLabs/Uraion-Agent-Steer", messages=[{"role": "user", "content": "What's 2+2?"}], temperature=0.7, ) print(response.choices[0].message.content) ``` **With tool calling:** ```python tools = [{ "type": "function", "function": { "name": "get_weather", "description": "Get current weather for a location", "parameters": { "type": "object", "properties": { "location": {"type": "string", "description": "City name"} }, "required": ["location"] } } }] response = client.chat.completions.create( model="UraionLabs/Uraion-Agent-Steer", messages=[{"role": "user", "content": "What's the weather in Tokyo?"}], tools=tools, temperature=0.0, ) print(response.choices[0].message.tool_calls) ``` **VRAM:** ~18 GB for vLLM serving. Recommended: A10, A100, or 24GB consumer GPU (RTX 3090/4090). --- ### Unsloth (further fine-tuning) Continue training Uraion-Agent-Steer with Unsloth for 2× faster, 70% less memory fine-tuning. ```python from unsloth import FastLanguageModel from unsloth.chat_templates import get_chat_template import torch # Load Uraion-Agent-Steer with Unsloth acceleration model, tokenizer = FastLanguageModel.from_pretrained( model_name="UraionLabs/Uraion-Agent-Steer", max_seq_length=2048, dtype=torch.bfloat16, load_in_4bit=True, # 4-bit for further QLoRA training trust_remote_code=True, ) # Apply QLoRA for continued training model = FastLanguageModel.get_peft_model( model, r=32, lora_alpha=32, target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], lora_dropout=0, bias="none", ) # Continue training with your data... # model = ... your training loop ... # Save LoRA adapters model.save_pretrained("uraion-agent-steer-continued") ``` > **Note:** The H-Res adapters remain frozen alongside the base model during Unsloth QLoRA training. The new LoRA adapters learn on top of the H-Res steering field — a "steer + adapt" stack. **VRAM:** ~8 GB with 4-bit QLoRA via Unsloth. --- ### Ollama Ollama uses GGUF format. Since H-Res adapters can't be directly merged into base weights, you have two options: **Option A: Import from safetensors (Ollama 0.3.0+)** ```bash # Create Modelfile cat > Modelfile << 'EOF' FROM UraionLabs/Uraion-Agent-Steer TEMPLATE """{{ if .System }}<|im_start|>system {{ .System }}<|im_end|> {{ end }}{{ if .Prompt }}<|im_start|>user {{ .Prompt }}<|im_end|> {{ end }}<|im_start|>assistant """ PARAMETER stop "<|im_start|>" PARAMETER stop "<|im_end|>" PARAMETER temperature 0.7 EOF # Import into Ollama ollama create uraion-agent-steer -f Modelfile # Run ollama run uraion-agent-steer ``` **Option B: Use the GGUF release (when available)** ```bash # Coming soon — LoRA-distilled GGUF version ollama run hf.co/UraionLabs/Uraion-Agent-Steer-GGUF:Q4_K_M ``` > **Note for Option A:** Ollama's safetensors import loads the full BF16 model (~15 GB download, ~16 GB VRAM). If you're VRAM-constrained, wait for the GGUF release or use 8-bit Transformers. --- ### LM Studio LM Studio works with GGUF models. For safetensors models, use one of these approaches: **Approach 1: Wait for GGUF release** The LoRA-distilled GGUF version (`UraionLabs/Uraion-Agent-Steer-GGUF`) will be importable directly in LM Studio's model browser. Pick a quant (Q4_K_M recommended) and download. **Approach 2: Use MLX (Apple Silicon)** If you're on a Mac with Apple Silicon, MLX can load safetensors directly: ```bash pip install mlx mlx-lm # Convert to MLX format mlx_lm.convert --hf-path UraionLabs/Uraion-Agent-Steer --mlx-path ./uraion-agent-steer-mlx # Run inference mlx_lm.generate --model ./uraion-agent-steer-mlx --prompt "What is tool calling?" ``` **Approach 3: Use vLLM or llama.cpp server** Run the model locally via vLLM (see above), then connect LM Studio to it via the "Local Server" option in LM Studio's settings. --- ### llama.cpp llama.cpp requires GGUF format. Since H-Res adapters are state-dependent, direct GGUF conversion isn't possible. Two paths: **Path 1: Use the GGUF-distilled release** ```bash # Coming soon llama-server -hf UraionLabs/Uraion-Agent-Steer-GGUF:Q4_K_M --host 0.0.0.0 --port 8000 ``` **Path 2: Use Transformers server + llama.cpp client** ```bash # Server side (Transformers with H-Res, fast) python -c " from transformers import AutoModelForCausalLM, AutoTokenizer import torch # ... serve with FastAPI or vLLM " # Client side (any OpenAI-compatible client) curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model":"uraion-agent-steer","messages":[{"role":"user","content":"Hello"}]}' ``` --- ### text-generation-webui (Oobabooga) Load directly in Oobabooga's Transformers loader: 1. Go to the **Model** tab 2. In "Download model or LoRA", enter: `UraionLabs/Uraion-Agent-Steer` 3. Click **Download** 4. After download, select the model and set: - **Loader:** Transformers - **trust_remote_code:** ✓ - **dtype:** bfloat16 5. Click **Load** For lower VRAM, enable `load_in_8bit` or `load_in_4bit` in the loader settings. **Chat template** (if not auto-detected): `chatml` (Qwen2.5 ChatML format). --- ### VRAM Reference | GPU | VRAM | Config | Notes | |-----|------|--------|-------| | RTX 4090 (24GB) | 24 GB | BF16 | Full quality, fits comfortably | | RTX 3090 (24GB) | 24 GB | BF16 | Same as above | | RTX 4080 (16GB) | 16 GB | BF16 | Tight — use 8-bit for safety | | RTX 3080 (10GB) | 10 GB | 8-bit | Works with `load_in_8bit=True` | | RTX 4070 (12GB) | 12 GB | 8-bit | Works with `load_in_8bit=True` | | A100 (40GB) | 40 GB | BF16 | Full quality, plenty of room | | A10 (24GB) | 24 GB | BF16 | Full quality | | T4 (16GB) | 16 GB | 8-bit | Use `load_in_8bit=True` | | Apple M2/M3 (16GB+) | Unified | MLX | Convert with `mlx_lm.convert` | --- ## H-Res Adapter Analysis After training, we inspected the learned H-Res adapters across all 28 layers: | Layer | Scale (λ) | ‖W_up‖ | ‖W_down‖ | Steering activity | |-------|-----------|--------|----------|-------------------| | 0 (early) | 0.1001 | 0.0000 | 7.94 | **Silent** — shallow layers don't steer | | 8 (mid) | 0.1001 | 2.12 | 8.45 | Moderate steering | | 16 (mid-deep) | 0.1001 | 2.87 | 9.12 | Active steering | | 24 (deep) | 0.1001 | 3.12 | 9.56 | Strong steering | | 27 (final) | 0.1001 | **3.72** | **9.69** | **Maximum steering** | **Key finding:** Steering intensity increases monotonically with layer depth. Early layers (0–3) have W_up ≈ 0 — the adapter is effectively dormant. Deep layers (20–27) have the strongest steering activity. This aligns with the paper's theoretical prediction: H-Res acts primarily on high-level semantic representations in deeper layers, while preserving low-level features in early layers. The scale parameter λ stayed at ~0.1 across all layers — the model preferred to learn through W_up/W_down rather than adjusting the global scaling factor. --- ## Hardware & Infrastructure | Component | Detail | |-----------|--------| | **Provisioning** | Google Colab CLI (`colab-cli`) via OAuth2 | | **GPU** | 1× NVIDIA A100-SXM4-40GB | | **Runtime** | `colab run --gpu A100 --keep --timeout 28800` | | **Training time** | ~3 hours (1,091 steps at ~10s/step) | | **VRAM usage** | ~35 GB (7.6B BF16 base + 12.8M H-Res + activations + optimizer) | | **Setup** | Self-installing dependencies via pip | | **Session lifecycle** | `colab run` → auto-execute → `--keep` → training → auto-upload → session release | Training dependencies auto-installed on Colab: `transformers>=4.57`, `trl>=0.21`, `datasets`, `accelerate`, `safetensors`, `huggingface_hub`. --- ## GGUF Availability H-Res adapters are **state-dependent** (nonlinear function of the input), so they can't be directly merged into base weights for standard GGUF/llama.cpp conversion. For GGUF inference: | Option | How | VRAM | Quality | |--------|-----|------|---------| | **[Ollama safetensors import](#ollama)** | `FROM UraionLabs/Uraion-Agent-Steer` in Modelfile | ~16 GB | Full H-Res quality | | **[MLX conversion](#lm-studio)** | `mlx_lm.convert` on Apple Silicon | ~16 GB unified | Full H-Res quality | | **LoRA-distilled GGUF** | `UraionLabs/Uraion-Agent-Steer-GGUF` (coming soon) | 4–8 GB | LoRA-approximated | | **8-bit Transformers** | `load_in_8bit=True` with Transformers | ~8 GB | Near-full quality | The LoRA-distilled GGUF release is in progress (Colab GPU quota recovery). For maximum quality TODAY, use Ollama's safetensors import or vLLM. --- ## Ethical Considerations This model is a fine-tune of Qwen2.5-7B-Instruct and inherits its base capabilities and biases: - Training data includes user-generated content from HuggingFace datasets, which may contain biases. - Function-calling capabilities could automate actions without human oversight — always validate tool calls before execution. - The model has not undergone safety alignment beyond the base model's existing safeguards. - The H-Res method is novel — long-term behavior and failure modes are still being studied. - This is a **research-stage artifact** from Uraion Labs. We are a systems research lab, not a product company. Use accordingly. --- ## Citations ### H-Res (Parallel Manifold Steering) ```bibtex @article{awadhiya2026parallel, title={Parallel Manifold Steering: Efficient Adaptation of Large Associative Memories via Residual Energy Shaping}, author={Awadhiya, Kanishk}, journal={ICLR Workshop on New Frontiers in Associative Memory}, year={2026}, url={https://arxiv.org/abs/2606.24396} } ``` ### Uraion-Agent-Steer ```bibtex @software{uraion-agent-steer, title={Uraion-Agent-Steer: Agentic Model via Hierarchical Residual Steering}, author={Uraion Labs}, year={2026}, url={https://huggingface.co/UraionLabs/Uraion-Agent-Steer} } ``` ### Qwen2.5 ```bibtex @misc{qwen2.5, title={Qwen2.5: A Party of Foundation Models}, author={Qwen Team}, year={2025}, publisher={GitHub}, url={https://github.com/QwenLM/Qwen2.5} } ``` ### TRL ```bibtex @software{vonwerra2020trl, title={{TRL: Transformers Reinforcement Learning}}, author={von Werra, Leandro and Belkada, Younes and Tunstall, Lewis and Beeching, Edward and Thrush, Tristan and Lambert, Nathan and Huang, Shengyi and Rasul, Kashif and Gallouédec, Quentin}, license={Apache-2.0}, url={https://github.com/huggingface/trl}, year={2020} } ``` ### Datasets ```bibtex @misc{hermesfc, title={NousResearch Hermes Function Calling}, author={Nous Research}, year={2024}, url={https://huggingface.co/datasets/NousResearch/hermes-function-calling-v1} } @misc{xlam2024, title={xLAM: A Family of Large Action Models}, author={Salesforce AI Research}, year={2024}, url={https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k} } @misc{finetome2024, title={FineTome-100k: A Curated Instruction Tuning Dataset}, author={Labonne, Maxime}, year={2024}, url={https://huggingface.co/datasets/mlabonne/FineTome-100k} } @misc{apigen2024, title={APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets}, author={Salesforce AI Research}, year={2024}, url={https://huggingface.co/datasets/Salesforce/APIGen-MT-5k} } @misc{glaivefc, title={Glaive Function Calling v2}, author={Glaive AI}, year={2024}, url={https://huggingface.co/datasets/glaiveai/glaive-function-calling-v2} } @misc{toolace2025, title={ToolACE: Winning the Points of LLM Function Calling}, author={Team ACE}, year={2025}, url={https://huggingface.co/datasets/Team-ACE/ToolACE} } ``` ---

Uraion Labs — Foundational systems research.
uraionlabs.com

Intelligence is a systems problem.
Licensed under Apache 2.0.