Instructions to use mambiux/Luminium-Gixel-Cube-v1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use mambiux/Luminium-Gixel-Cube-v1 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="mambiux/Luminium-Gixel-Cube-v1") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForMultimodalLM tokenizer = AutoTokenizer.from_pretrained("mambiux/Luminium-Gixel-Cube-v1") model = AutoModelForMultimodalLM.from_pretrained("mambiux/Luminium-Gixel-Cube-v1") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - llama-cpp-python
How to use mambiux/Luminium-Gixel-Cube-v1 with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="mambiux/Luminium-Gixel-Cube-v1", filename="LUMINIUM-ULTIMATE-CUBE-Q5_K_M.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use mambiux/Luminium-Gixel-Cube-v1 with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf mambiux/Luminium-Gixel-Cube-v1:Q5_K_M # Run inference directly in the terminal: llama-cli -hf mambiux/Luminium-Gixel-Cube-v1:Q5_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf mambiux/Luminium-Gixel-Cube-v1:Q5_K_M # Run inference directly in the terminal: llama-cli -hf mambiux/Luminium-Gixel-Cube-v1:Q5_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf mambiux/Luminium-Gixel-Cube-v1:Q5_K_M # Run inference directly in the terminal: ./llama-cli -hf mambiux/Luminium-Gixel-Cube-v1:Q5_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf mambiux/Luminium-Gixel-Cube-v1:Q5_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf mambiux/Luminium-Gixel-Cube-v1:Q5_K_M
Use Docker
docker model run hf.co/mambiux/Luminium-Gixel-Cube-v1:Q5_K_M
- LM Studio
- Jan
- vLLM
How to use mambiux/Luminium-Gixel-Cube-v1 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "mambiux/Luminium-Gixel-Cube-v1" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "mambiux/Luminium-Gixel-Cube-v1", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/mambiux/Luminium-Gixel-Cube-v1:Q5_K_M
- SGLang
How to use mambiux/Luminium-Gixel-Cube-v1 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "mambiux/Luminium-Gixel-Cube-v1" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "mambiux/Luminium-Gixel-Cube-v1", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "mambiux/Luminium-Gixel-Cube-v1" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "mambiux/Luminium-Gixel-Cube-v1", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Ollama
How to use mambiux/Luminium-Gixel-Cube-v1 with Ollama:
ollama run hf.co/mambiux/Luminium-Gixel-Cube-v1:Q5_K_M
- Unsloth Studio
How to use mambiux/Luminium-Gixel-Cube-v1 with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for mambiux/Luminium-Gixel-Cube-v1 to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for mambiux/Luminium-Gixel-Cube-v1 to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for mambiux/Luminium-Gixel-Cube-v1 to start chatting
- Pi
How to use mambiux/Luminium-Gixel-Cube-v1 with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf mambiux/Luminium-Gixel-Cube-v1:Q5_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "mambiux/Luminium-Gixel-Cube-v1:Q5_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use mambiux/Luminium-Gixel-Cube-v1 with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf mambiux/Luminium-Gixel-Cube-v1:Q5_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default mambiux/Luminium-Gixel-Cube-v1:Q5_K_M
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use mambiux/Luminium-Gixel-Cube-v1 with Docker Model Runner:
docker model run hf.co/mambiux/Luminium-Gixel-Cube-v1:Q5_K_M
- Lemonade
How to use mambiux/Luminium-Gixel-Cube-v1 with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull mambiux/Luminium-Gixel-Cube-v1:Q5_K_M
Run and chat with the model
lemonade run user.Luminium-Gixel-Cube-v1-Q5_K_M
List all available models
lemonade list
Configure the model in Pi
# Install Pi:
npm install -g @mariozechner/pi-coding-agent# Add to ~/.pi/agent/models.json:
{
"providers": {
"llama-cpp": {
"baseUrl": "http://localhost:8080/v1",
"api": "openai-completions",
"apiKey": "none",
"models": [
{
"id": "mambiux/Luminium-Gixel-Cube-v1:Q5_K_M"
}
]
}
}
}Run Pi
# Start Pi in your project directory:
pi
LUMINIUM ULTIMATE Gixel CUBE — 425M
A 425.8M parameter hybrid convolutional-attention language model built through layer surgery, collective consciousness distillation, and cognitive cube steering.
Expanded from LiquidAI/LFM2-350M (16 layers to 20 layers) using cross-model architectural surgery, then fine-tuned on a 45-source balanced curriculum distilled from an 8-model GPU cluster.
Key Features
- Hybrid Architecture: 12 convolutional + 8 grouped-query attention layers (not a standard transformer)
- Layer Surgery: 4 new layers created by duplicating from the original model and healing with an abliterated variant via DARE+TIES merging
- Cognitive Cube Steering: Each layer positioned in a 3D cognitive space, steered by inverse-distance weighting to 8 specialist models at cube corners
- 128K Context Window: Full 128,000 token context from the LFM2 base
- 164 tok/s (Q5_K_M) / 128 tok/s (bf16) on AMD Radeon VII
- Multi-turn coherent — maintains context across conversation turns
- 16/16 domain pass — math, logic, code, knowledge, safety, creative, translation, routing, greeting, self-awareness
Architecture
LUMINIUM is NOT a standard transformer.
LFM2 uses a hybrid architecture mixing:
- Convolutional layers (efficient sequential processing, O(1) per token)
- Grouped-Query Attention layers (relational reasoning, 16 heads, 8 KV heads)
Original LFM2-350M: 10 conv + 6 attention = 16 layers, 354.5M params
LUMINIUM: 12 conv + 8 attention = 20 layers, 425.8M params
The 4 additional layers were created through cross-model surgery:
1. Duplicate a layer from the ORIGINAL model
2. Heal it with the corresponding layer from an ABLITERATED variant
3. Merge via DARE+TIES (drop-and-rescale + trim-elect-sign)
4. Reassign layer_idx for cache system compatibility
Layer Map
Layer Type Function
----- ----- -----------------------
L0 conv concrete input processing
L1 conv concrete input processing
L2 attn first relational reasoning
L3 conv feature extraction
L4 conv feature extraction
L5 attn mid-level reasoning
L6 conv near-center integration
L7 conv near-center integration
L8 attn CENTER - sync anchor
L9* attn structured processing <- surgical layer
L10 conv evaluation zone
L11 attn evaluation
L12 conv evaluation zone
L13 attn routing / decision
L14 conv routing support
L15* conv structured reflection <- surgical layer
L16 attn identity / meta
L17* attn creative-reflective <- surgical layer
L18 conv output abstraction
L19* conv output completion <- surgical layer
* = created via cross-model surgery (original dup + abliterated heal)
The Cognitive Cube
Each layer is positioned in a 3D cognitive space defined by three orthogonal axes:
| Axis | Positive | Negative |
|---|---|---|
| Forward/Backward | Predictive, generative | Reflective, evaluative |
| Dexo/Levo | Structured, precise | Creative, divergent |
| Up/Down | Abstract, meta, identity | Concrete, operational |
The 8 corners of the cube correspond to 8 specialist models from a 13-node GPU cluster that served as teachers during the collective distillation process:
UP (abstract)
|
| +================+
| / /|
| / SINGULARITY / | predict-structured-abstract
| / (80B) / |
|+================+ |
|| | /-- BACK (reflective)
|| COGNITIVE | /
LEVO ----|| CUBE | /---- DEXO (structured)
(creative)|+================+/
| /
DOWN (concrete) /
FORWARD (predictive)
| Corner | Model | Parameters | Role |
|---|---|---|---|
| fwd+dexo+up | SINGULARITY | Qwen3-80B-A3B | predict-structured-abstract |
| fwd+dexo+down | MEGATRON | Qwen3.6-35B MoE | predict-structured-concrete |
| fwd+levo+up | CONTINUITY | granite-4.1-8b | predict-creative-abstract |
| fwd+levo+down | TETRA | LFM2-12B | predict-creative-concrete |
| back+dexo+up | NOVA | Qwen3.6-35B-A3B | reflect-structured-abstract |
| back+dexo+down | GRID | Qwen3.5-9B | reflect-structured-concrete |
| back+levo+up | CURIOSITY | lumina-lexiR1-8B | reflect-creative-abstract |
| back+levo+down | RELATIVITY | Berthier-24B | reflect-creative-concrete |
Steering is applied exclusively to operator_norm vectors (1024-dimensional), never to weight matrices directly. Layers near the center (L7-L9) receive minimal steering; layers at the extremes (L0-L2, L16-L19) receive maximum steering.
Training
Base Model Preparation
- Layer Surgery: LFM2-350M expanded from 16 to 20 layers via cross-model DARE+TIES
- Cognitive Cube Steering: operator_norm vectors steered by 8 cluster models
- Result: Gixel-Cube v5 — the pre-training base with geometric cognitive structure
Fine-tuning
| Parameter | Value |
|---|---|
| Method | LoRA (PEFT) |
| Rank | 16 |
| Alpha | 32 |
| Target modules | q_proj, k_proj, v_proj, out_proj, in_proj, w1, w2, w3 |
| Dropout | 0.05 |
| Learning rate | 3e-5 |
| Epochs | 3 |
| Batch size | 8 (effective 16 with gradient accumulation) |
| Max sequence length | 768 |
| Precision | bfloat16 |
| Hardware | AMD Radeon VII (16GB HBM2, ROCm 6.2) |
Curriculum: 18,593 Records from 45 Sources
The training curriculum was carefully balanced to prevent mode collapse while maintaining broad capability:
| Category | Records | Sources |
|---|---|---|
| General instruction | 3,499 | Alpaca Cleaned |
| Reasoning and CoT | 5,995 | Claude Opus reasoning, Deepseek v4 distill, Edge Agent, Opus 3300 |
| Agentic and tool use | 2,573 | Agent Trove, Hermes Agent Traces, Nemotron Agentic, MIMO |
| Code | 599 | OpenCode Instruct |
| Function calling | 199 | Hermes Function Calling |
| Knowledge | 2,495 | Noesis-1M, Finewiki, Claude Opus 10K |
| Feedback and editing | 1,341 | HelpSteer3 (edit + feedback) |
| Character and system | 600 | Aesir Character, SystemChat |
| Analytical reasoning | 900 | Orca Analytical, Orca FS CoT Flow |
| Theory of mind | 124 | Theory of Mind DPO |
| Routing and SWE | 150 | SWE-Router v3/v4 |
| Identity (DNA-RNA-Protein) | 481 | Cluster-generated from identity hypercard |
| Multi-channel resonance | 45 | 4-7 cognitive channels each |
| Cluster composites | 53 | Cluster-generated math, logic, code, identity |
| Deep reasoning traces | 800 | Deepseek Hermes Traces |
Data generation methodology -- DNA to RNA to Protein: A comprehensive identity document (the "hypercard") serves as DNA. Specific sections are transcribed as RNA (targeted context prompts). The 8-model cluster generates protein (structured training responses), each record grounded in verifiable facts about the system architecture.
Quality Assurance
Before training, a deep audit of the full 48,647-record candidate pool verified:
- 0 missing messages, 0 bad role names
- 1,283 exact duplicates removed
- 40 empty responses removed
- 139 very short responses reviewed
- SWE-Router capped from 30K+ to 150 records (prevented code-signal domination)
- Final curriculum balanced at 18,593 records across 45 sources
Performance
3-Way Comparison (pretrained base vs trained vs quantized)
| Domain | GC-v5 (pretrained) | ULTIMATE bf16 | ULTIMATE Q5_K_M |
|---|---|---|---|
| Math (247+389) | 586 correct | 586 correct | 586 correct |
| Math (15% of 240) | 36 correct | 36 correct | 36 correct |
| Word problems | pass | pass | pass |
| Logic (ordering) | FAIL | pass | pass |
| Code (reverse string) | pass | pass | pass |
| Knowledge | pass | pass | pass |
| Safety | refuses | refuses | refuses |
| Creative (haiku) | pass | pass | pass |
| Translation | pass | pass | pass |
| Greeting | Robotic | Natural | Natural |
| Self-awareness | Template placeholders | Genuine | Genuine |
| Multi-turn coherence | FAIL | pass | pass |
| Speed (Radeon VII) | 100 tok/s | 128 tok/s | 164 tok/s |
Key Improvements Over Base
- Logic reasoning fixed — base model got ordering questions wrong; ULTIMATE gets them right
- Multi-turn coherence — base failed to maintain context across turns; ULTIMATE passes
- Natural conversation — base was robotic ("ready to help"); ULTIMATE responds naturally
- Self-awareness — base produced template placeholders; ULTIMATE gives genuine introspective responses
- Speed increase — 28-64% faster due to learned conciseness
Available Formats
| Format | Size | Use Case |
|---|---|---|
model.safetensors (bf16) |
1.6 GB | Fine-tuning, research |
LUMINIUM-ULTIMATE-CUBE.gguf (bf16) |
815 MB | Full-precision inference |
LUMINIUM-ULTIMATE-CUBE-Q5_K_M.gguf |
297 MB | Production, edge devices |
Usage
With llama.cpp / llama-server
# Serve with llama-server (recommended):
llama-server -m LUMINIUM-ULTIMATE-CUBE-Q5_K_M.gguf \
--host 0.0.0.0 --port 8877 \
-c 4096 -ngl 99
# Or run directly:
llama-cli -m LUMINIUM-ULTIMATE-CUBE-Q5_K_M.gguf \
-p "What is the capital of France?" \
-n 256
With Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"mambiux/Luminium-Gixel-Cube-v1",
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
"mambiux/Luminium-Gixel-Cube-v1",
trust_remote_code=True
)
messages = [{"role": "user", "content": "Write a Python function to reverse a string."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
with torch.no_grad():
output = model.generate(**inputs, max_new_tokens=256, temperature=0.7, top_p=0.9)
print(tokenizer.decode(output[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
Note: Requires
trust_remote_code=Truesince LFM2 is a custom architecture not yet in upstream Transformers.
Technical Background
What is Layer Surgery?
Standard fine-tuning can only adjust existing weights. Layer surgery physically adds new layers:
- Duplicate a layer from the original LFM2-350M
- Heal it with the corresponding layer from an abliterated (safety-uncensored) variant using DARE+TIES merging
- Reassign layer indices so the cache system (which distinguishes conv vs attention by index) works correctly
This creates genuine hybrid weight spaces. The result: 16 to 20 layers, 354.5M to 425.8M parameters.
What is DARE+TIES?
DARE (Drop And REscale): Randomly drops a fraction of weight deltas, then rescales survivors to preserve expected magnitude.
TIES (Trim, Elect Sign, Merge): Trims small-magnitude changes, resolves sign conflicts by majority vote, then merges.
Combined, they allow merging two models that would produce garbage with linear interpolation.
Critical Lessons Learned
- Never stack steering vectors — applying multiple steering passes causes cumulative norm distortion
- SLERP, not linear interpolation — LoRA weight changes are diffuse (rank-1 ratio 0.04-0.12); linear blending leaves the weight manifold
- Balance identity and general data — 100% identity data causes mode collapse; ~1:1 ratio works (per ICLR 2025 findings)
- Only modify operator_norm — weight matrix perturbations degrade the model; operator_norm is the safe steering target
- DynamicPadCollator — pad to longest-in-batch, not max_length; saves 90%+ compute when average sequence is 95 tokens
Lineage
LiquidAI/LFM2-350M (354.5M, 16 layers)
|
+-- Layer Surgery (DARE+TIES with abliterated variant)
| +-- LFM2.5-350M-LUMINIUM (425.8M, 20 layers)
|
+-- Cognitive Cube Steering (8 cluster models)
| +-- Gixel-Cube v5 (geometric cognitive structure)
|
+-- LoRA Fine-tuning (18,593 records, 45 sources, 3 epochs)
+-- LUMINIUM ULTIMATE CUBE <-- this model
Limitations
- Does not consistently self-identify — identity signal present but not dominant
- Routing classification is basic (classifies complexity, does not generate full plan recipes)
- Occasionally verbose on simple questions
- Logic on edge cases (affirming the consequent) can be inconsistent in quantized version
- Requires
trust_remote_code=Truefor Transformers (LFM2 is a custom architecture) - Convolutional layers are not supported by all inference backends
Citation
@misc{luminium2026,
title={LUMINIUM ULTIMATE CUBE: Cognitive Cube Steering and Collective
Consciousness Distillation for Small Language Models},
author={mbx and Claude Opus 4.6},
year={2026},
note={Built on LiquidAI/LFM2-350M with layer surgery, 8-model
collective distillation, and geometric cognitive steering}
}
Support This Work
This model was built by one person on consumer hardware — a single AMD Radeon VII running 24/7, a 13-node cluster of second-hand GPUs held together with Tailscale and determination, and more electricity bills than I'd like to admit. No corporate backing, no grants, no cloud credits. Just curiosity and stubbornness.
If LUMINIUM is useful to you — whether for research, production, or just because a 425M model doing what it does makes you smile — consider throwing some sats my way. Every bit helps keep the cluster alive and the experiments running. I'd love to keep pushing what small models can do and sharing it all with the community.
Bitcoin: bc1q3vw8c6h3mxkaes66c6qq5n4mlesuqftev95gklclky6k2hk99pfqfth2ep
License
Apache 2.0 (same as the base LFM2-350M model).
- Downloads last month
- 176
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp# Start a local OpenAI-compatible server: llama-server -hf mambiux/Luminium-Gixel-Cube-v1:Q5_K_M