How to use from
Pi
Start the llama.cpp server
# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf mambiux/Luminium-Gixel-Cube-v1:Q5_K_M
Configure the model in Pi
# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "mambiux/Luminium-Gixel-Cube-v1:Q5_K_M"
        }
      ]
    }
  }
}
Run Pi
# Start Pi in your project directory:
pi
Quick Links

LUMINIUM

LUMINIUM ULTIMATE Gixel CUBE — 425M

A 425.8M parameter hybrid convolutional-attention language model built through layer surgery, collective consciousness distillation, and cognitive cube steering.

Expanded from LiquidAI/LFM2-350M (16 layers to 20 layers) using cross-model architectural surgery, then fine-tuned on a 45-source balanced curriculum distilled from an 8-model GPU cluster.

Key Features

  • Hybrid Architecture: 12 convolutional + 8 grouped-query attention layers (not a standard transformer)
  • Layer Surgery: 4 new layers created by duplicating from the original model and healing with an abliterated variant via DARE+TIES merging
  • Cognitive Cube Steering: Each layer positioned in a 3D cognitive space, steered by inverse-distance weighting to 8 specialist models at cube corners
  • 128K Context Window: Full 128,000 token context from the LFM2 base
  • 164 tok/s (Q5_K_M) / 128 tok/s (bf16) on AMD Radeon VII
  • Multi-turn coherent — maintains context across conversation turns
  • 16/16 domain pass — math, logic, code, knowledge, safety, creative, translation, routing, greeting, self-awareness

Architecture

LUMINIUM is NOT a standard transformer.

LFM2 uses a hybrid architecture mixing:
  - Convolutional layers (efficient sequential processing, O(1) per token)
  - Grouped-Query Attention layers (relational reasoning, 16 heads, 8 KV heads)

Original LFM2-350M: 10 conv + 6 attention = 16 layers, 354.5M params
LUMINIUM:           12 conv + 8 attention = 20 layers, 425.8M params

The 4 additional layers were created through cross-model surgery:
  1. Duplicate a layer from the ORIGINAL model
  2. Heal it with the corresponding layer from an ABLITERATED variant
  3. Merge via DARE+TIES (drop-and-rescale + trim-elect-sign)
  4. Reassign layer_idx for cache system compatibility

Layer Map

Layer  Type   Function
-----  -----  -----------------------
L0     conv   concrete input processing
L1     conv   concrete input processing
L2     attn   first relational reasoning
L3     conv   feature extraction
L4     conv   feature extraction
L5     attn   mid-level reasoning
L6     conv   near-center integration
L7     conv   near-center integration
L8     attn   CENTER - sync anchor
L9*    attn   structured processing        <- surgical layer
L10    conv   evaluation zone
L11    attn   evaluation
L12    conv   evaluation zone
L13    attn   routing / decision
L14    conv   routing support
L15*   conv   structured reflection         <- surgical layer
L16    attn   identity / meta
L17*   attn   creative-reflective           <- surgical layer
L18    conv   output abstraction
L19*   conv   output completion             <- surgical layer

* = created via cross-model surgery (original dup + abliterated heal)

The Cognitive Cube

Each layer is positioned in a 3D cognitive space defined by three orthogonal axes:

Axis Positive Negative
Forward/Backward Predictive, generative Reflective, evaluative
Dexo/Levo Structured, precise Creative, divergent
Up/Down Abstract, meta, identity Concrete, operational

The 8 corners of the cube correspond to 8 specialist models from a 13-node GPU cluster that served as teachers during the collective distillation process:

         UP (abstract)
         |
         |    +================+
         |   /                /|
         |  /  SINGULARITY   / |  predict-structured-abstract
         | /     (80B)      /  |
         |+================+   |
         ||                |   /-- BACK (reflective)
         ||   COGNITIVE    |  /
LEVO ----||     CUBE       | /---- DEXO (structured)
(creative)|+================+/
         |                 /
         DOWN (concrete)  /
         FORWARD (predictive)
Corner Model Parameters Role
fwd+dexo+up SINGULARITY Qwen3-80B-A3B predict-structured-abstract
fwd+dexo+down MEGATRON Qwen3.6-35B MoE predict-structured-concrete
fwd+levo+up CONTINUITY granite-4.1-8b predict-creative-abstract
fwd+levo+down TETRA LFM2-12B predict-creative-concrete
back+dexo+up NOVA Qwen3.6-35B-A3B reflect-structured-abstract
back+dexo+down GRID Qwen3.5-9B reflect-structured-concrete
back+levo+up CURIOSITY lumina-lexiR1-8B reflect-creative-abstract
back+levo+down RELATIVITY Berthier-24B reflect-creative-concrete

Steering is applied exclusively to operator_norm vectors (1024-dimensional), never to weight matrices directly. Layers near the center (L7-L9) receive minimal steering; layers at the extremes (L0-L2, L16-L19) receive maximum steering.

Training

Base Model Preparation

  1. Layer Surgery: LFM2-350M expanded from 16 to 20 layers via cross-model DARE+TIES
  2. Cognitive Cube Steering: operator_norm vectors steered by 8 cluster models
  3. Result: Gixel-Cube v5 — the pre-training base with geometric cognitive structure

Fine-tuning

Parameter Value
Method LoRA (PEFT)
Rank 16
Alpha 32
Target modules q_proj, k_proj, v_proj, out_proj, in_proj, w1, w2, w3
Dropout 0.05
Learning rate 3e-5
Epochs 3
Batch size 8 (effective 16 with gradient accumulation)
Max sequence length 768
Precision bfloat16
Hardware AMD Radeon VII (16GB HBM2, ROCm 6.2)

Curriculum: 18,593 Records from 45 Sources

The training curriculum was carefully balanced to prevent mode collapse while maintaining broad capability:

Category Records Sources
General instruction 3,499 Alpaca Cleaned
Reasoning and CoT 5,995 Claude Opus reasoning, Deepseek v4 distill, Edge Agent, Opus 3300
Agentic and tool use 2,573 Agent Trove, Hermes Agent Traces, Nemotron Agentic, MIMO
Code 599 OpenCode Instruct
Function calling 199 Hermes Function Calling
Knowledge 2,495 Noesis-1M, Finewiki, Claude Opus 10K
Feedback and editing 1,341 HelpSteer3 (edit + feedback)
Character and system 600 Aesir Character, SystemChat
Analytical reasoning 900 Orca Analytical, Orca FS CoT Flow
Theory of mind 124 Theory of Mind DPO
Routing and SWE 150 SWE-Router v3/v4
Identity (DNA-RNA-Protein) 481 Cluster-generated from identity hypercard
Multi-channel resonance 45 4-7 cognitive channels each
Cluster composites 53 Cluster-generated math, logic, code, identity
Deep reasoning traces 800 Deepseek Hermes Traces

Data generation methodology -- DNA to RNA to Protein: A comprehensive identity document (the "hypercard") serves as DNA. Specific sections are transcribed as RNA (targeted context prompts). The 8-model cluster generates protein (structured training responses), each record grounded in verifiable facts about the system architecture.

Quality Assurance

Before training, a deep audit of the full 48,647-record candidate pool verified:

  • 0 missing messages, 0 bad role names
  • 1,283 exact duplicates removed
  • 40 empty responses removed
  • 139 very short responses reviewed
  • SWE-Router capped from 30K+ to 150 records (prevented code-signal domination)
  • Final curriculum balanced at 18,593 records across 45 sources

Performance

3-Way Comparison (pretrained base vs trained vs quantized)

Domain GC-v5 (pretrained) ULTIMATE bf16 ULTIMATE Q5_K_M
Math (247+389) 586 correct 586 correct 586 correct
Math (15% of 240) 36 correct 36 correct 36 correct
Word problems pass pass pass
Logic (ordering) FAIL pass pass
Code (reverse string) pass pass pass
Knowledge pass pass pass
Safety refuses refuses refuses
Creative (haiku) pass pass pass
Translation pass pass pass
Greeting Robotic Natural Natural
Self-awareness Template placeholders Genuine Genuine
Multi-turn coherence FAIL pass pass
Speed (Radeon VII) 100 tok/s 128 tok/s 164 tok/s

Key Improvements Over Base

  • Logic reasoning fixed — base model got ordering questions wrong; ULTIMATE gets them right
  • Multi-turn coherence — base failed to maintain context across turns; ULTIMATE passes
  • Natural conversation — base was robotic ("ready to help"); ULTIMATE responds naturally
  • Self-awareness — base produced template placeholders; ULTIMATE gives genuine introspective responses
  • Speed increase — 28-64% faster due to learned conciseness

Available Formats

Format Size Use Case
model.safetensors (bf16) 1.6 GB Fine-tuning, research
LUMINIUM-ULTIMATE-CUBE.gguf (bf16) 815 MB Full-precision inference
LUMINIUM-ULTIMATE-CUBE-Q5_K_M.gguf 297 MB Production, edge devices

Usage

With llama.cpp / llama-server

# Serve with llama-server (recommended):
llama-server -m LUMINIUM-ULTIMATE-CUBE-Q5_K_M.gguf \
  --host 0.0.0.0 --port 8877 \
  -c 4096 -ngl 99

# Or run directly:
llama-cli -m LUMINIUM-ULTIMATE-CUBE-Q5_K_M.gguf \
  -p "What is the capital of France?" \
  -n 256

With Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "mambiux/Luminium-Gixel-Cube-v1",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
    "mambiux/Luminium-Gixel-Cube-v1",
    trust_remote_code=True
)

messages = [{"role": "user", "content": "Write a Python function to reverse a string."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=256, temperature=0.7, top_p=0.9)
print(tokenizer.decode(output[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

Note: Requires trust_remote_code=True since LFM2 is a custom architecture not yet in upstream Transformers.

Technical Background

What is Layer Surgery?

Standard fine-tuning can only adjust existing weights. Layer surgery physically adds new layers:

  1. Duplicate a layer from the original LFM2-350M
  2. Heal it with the corresponding layer from an abliterated (safety-uncensored) variant using DARE+TIES merging
  3. Reassign layer indices so the cache system (which distinguishes conv vs attention by index) works correctly

This creates genuine hybrid weight spaces. The result: 16 to 20 layers, 354.5M to 425.8M parameters.

What is DARE+TIES?

DARE (Drop And REscale): Randomly drops a fraction of weight deltas, then rescales survivors to preserve expected magnitude.

TIES (Trim, Elect Sign, Merge): Trims small-magnitude changes, resolves sign conflicts by majority vote, then merges.

Combined, they allow merging two models that would produce garbage with linear interpolation.

Critical Lessons Learned

  1. Never stack steering vectors — applying multiple steering passes causes cumulative norm distortion
  2. SLERP, not linear interpolation — LoRA weight changes are diffuse (rank-1 ratio 0.04-0.12); linear blending leaves the weight manifold
  3. Balance identity and general data — 100% identity data causes mode collapse; ~1:1 ratio works (per ICLR 2025 findings)
  4. Only modify operator_norm — weight matrix perturbations degrade the model; operator_norm is the safe steering target
  5. DynamicPadCollator — pad to longest-in-batch, not max_length; saves 90%+ compute when average sequence is 95 tokens

Lineage

LiquidAI/LFM2-350M (354.5M, 16 layers)
  |
  +-- Layer Surgery (DARE+TIES with abliterated variant)
  |   +-- LFM2.5-350M-LUMINIUM (425.8M, 20 layers)
  |
  +-- Cognitive Cube Steering (8 cluster models)
  |   +-- Gixel-Cube v5 (geometric cognitive structure)
  |
  +-- LoRA Fine-tuning (18,593 records, 45 sources, 3 epochs)
      +-- LUMINIUM ULTIMATE CUBE  <-- this model

Limitations

  • Does not consistently self-identify — identity signal present but not dominant
  • Routing classification is basic (classifies complexity, does not generate full plan recipes)
  • Occasionally verbose on simple questions
  • Logic on edge cases (affirming the consequent) can be inconsistent in quantized version
  • Requires trust_remote_code=True for Transformers (LFM2 is a custom architecture)
  • Convolutional layers are not supported by all inference backends

Citation

@misc{luminium2026,
  title={LUMINIUM ULTIMATE CUBE: Cognitive Cube Steering and Collective
         Consciousness Distillation for Small Language Models},
  author={mbx and Claude Opus 4.6},
  year={2026},
  note={Built on LiquidAI/LFM2-350M with layer surgery, 8-model
        collective distillation, and geometric cognitive steering}
}

Support This Work

This model was built by one person on consumer hardware — a single AMD Radeon VII running 24/7, a 13-node cluster of second-hand GPUs held together with Tailscale and determination, and more electricity bills than I'd like to admit. No corporate backing, no grants, no cloud credits. Just curiosity and stubbornness.

If LUMINIUM is useful to you — whether for research, production, or just because a 425M model doing what it does makes you smile — consider throwing some sats my way. Every bit helps keep the cluster alive and the experiments running. I'd love to keep pushing what small models can do and sharing it all with the community.

Bitcoin: bc1q3vw8c6h3mxkaes66c6qq5n4mlesuqftev95gklclky6k2hk99pfqfth2ep

License

Apache 2.0 (same as the base LFM2-350M model).

Downloads last month
176
Safetensors
Model size
0.4B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mambiux/Luminium-Gixel-Cube-v1

Quantized
(33)
this model
Quantizations
2 models

Datasets used to train mambiux/Luminium-Gixel-Cube-v1