Instructions to use mambiux/Luminium-Gixel-Cube-v1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use mambiux/Luminium-Gixel-Cube-v1 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="mambiux/Luminium-Gixel-Cube-v1")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("mambiux/Luminium-Gixel-Cube-v1")
model = AutoModelForCausalLM.from_pretrained("mambiux/Luminium-Gixel-Cube-v1", device_map="auto")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use mambiux/Luminium-Gixel-Cube-v1 with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf mambiux/Luminium-Gixel-Cube-v1:Q5_K_M
# Run inference directly in the terminal:
llama cli -hf mambiux/Luminium-Gixel-Cube-v1:Q5_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf mambiux/Luminium-Gixel-Cube-v1:Q5_K_M
# Run inference directly in the terminal:
llama cli -hf mambiux/Luminium-Gixel-Cube-v1:Q5_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf mambiux/Luminium-Gixel-Cube-v1:Q5_K_M
# Run inference directly in the terminal:
./llama-cli -hf mambiux/Luminium-Gixel-Cube-v1:Q5_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf mambiux/Luminium-Gixel-Cube-v1:Q5_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf mambiux/Luminium-Gixel-Cube-v1:Q5_K_M

Use Docker

docker model run hf.co/mambiux/Luminium-Gixel-Cube-v1:Q5_K_M

LM Studio
Jan

vLLM

How to use mambiux/Luminium-Gixel-Cube-v1 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "mambiux/Luminium-Gixel-Cube-v1"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "mambiux/Luminium-Gixel-Cube-v1",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/mambiux/Luminium-Gixel-Cube-v1:Q5_K_M

SGLang

How to use mambiux/Luminium-Gixel-Cube-v1 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "mambiux/Luminium-Gixel-Cube-v1" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "mambiux/Luminium-Gixel-Cube-v1",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "mambiux/Luminium-Gixel-Cube-v1" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "mambiux/Luminium-Gixel-Cube-v1",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Ollama
How to use mambiux/Luminium-Gixel-Cube-v1 with Ollama:
```
ollama run hf.co/mambiux/Luminium-Gixel-Cube-v1:Q5_K_M
```

Unsloth Studio

How to use mambiux/Luminium-Gixel-Cube-v1 with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for mambiux/Luminium-Gixel-Cube-v1 to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for mambiux/Luminium-Gixel-Cube-v1 to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for mambiux/Luminium-Gixel-Cube-v1 to start chatting

How to use mambiux/Luminium-Gixel-Cube-v1 with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf mambiux/Luminium-Gixel-Cube-v1:Q5_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "mambiux/Luminium-Gixel-Cube-v1:Q5_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use mambiux/Luminium-Gixel-Cube-v1 with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf mambiux/Luminium-Gixel-Cube-v1:Q5_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default mambiux/Luminium-Gixel-Cube-v1:Q5_K_M

Run Hermes

hermes

Atomic Chat new

OpenClaw new

How to use mambiux/Luminium-Gixel-Cube-v1 with OpenClaw:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf mambiux/Luminium-Gixel-Cube-v1:Q5_K_M

Configure OpenClaw

# Install OpenClaw:
npm install -g openclaw@latest
# Register the local server and set it as the default model:
openclaw onboard --non-interactive --mode local \
  --auth-choice custom-api-key \
  --custom-base-url http://127.0.0.1:8080/v1 \
  --custom-model-id "mambiux/Luminium-Gixel-Cube-v1:Q5_K_M" \
  --custom-provider-id llama-cpp \
  --custom-compatibility openai \
  --custom-text-input \
  --accept-risk \
  --skip-health

Run OpenClaw

openclaw agent --local --agent main --message "Hello from Hugging Face"

Docker Model Runner
How to use mambiux/Luminium-Gixel-Cube-v1 with Docker Model Runner:
```
docker model run hf.co/mambiux/Luminium-Gixel-Cube-v1:Q5_K_M
```

Lemonade

How to use mambiux/Luminium-Gixel-Cube-v1 with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull mambiux/Luminium-Gixel-Cube-v1:Q5_K_M

Run and chat with the model

lemonade run user.Luminium-Gixel-Cube-v1-Q5_K_M

List all available models

lemonade list

LUMINIUM

LUMINIUM ULTIMATE Gixel CUBE — 425M

A 425.8M parameter hybrid convolutional-attention language model built through layer surgery, collective consciousness distillation, and cognitive cube steering.

Expanded from LiquidAI/LFM2-350M (16 layers to 20 layers) using cross-model architectural surgery, then fine-tuned on a 45-source balanced curriculum distilled from an 8-model GPU cluster.

Key Features

Hybrid Architecture: 12 convolutional + 8 grouped-query attention layers (not a standard transformer)
Layer Surgery: 4 new layers created by duplicating from the original model and healing with an abliterated variant via DARE+TIES merging
Cognitive Cube Steering: Each layer positioned in a 3D cognitive space, steered by inverse-distance weighting to 8 specialist models at cube corners
128K Context Window: Full 128,000 token context from the LFM2 base
164 tok/s (Q5_K_M) / 128 tok/s (bf16) on AMD Radeon VII
Multi-turn coherent — maintains context across conversation turns
16/16 domain pass — math, logic, code, knowledge, safety, creative, translation, routing, greeting, self-awareness

Architecture

LUMINIUM is NOT a standard transformer.

LFM2 uses a hybrid architecture mixing:
  - Convolutional layers (efficient sequential processing, O(1) per token)
  - Grouped-Query Attention layers (relational reasoning, 16 heads, 8 KV heads)

Original LFM2-350M: 10 conv + 6 attention = 16 layers, 354.5M params
LUMINIUM:           12 conv + 8 attention = 20 layers, 425.8M params

The 4 additional layers were created through cross-model surgery:
  1. Duplicate a layer from the ORIGINAL model
  2. Heal it with the corresponding layer from an ABLITERATED variant
  3. Merge via DARE+TIES (drop-and-rescale + trim-elect-sign)
  4. Reassign layer_idx for cache system compatibility

Layer Map

Layer  Type   Function
-----  -----  -----------------------
L0     conv   concrete input processing
L1     conv   concrete input processing
L2     attn   first relational reasoning
L3     conv   feature extraction
L4     conv   feature extraction
L5     attn   mid-level reasoning
L6     conv   near-center integration
L7     conv   near-center integration
L8     attn   CENTER - sync anchor
L9*    attn   structured processing        <- surgical layer
L10    conv   evaluation zone
L11    attn   evaluation
L12    conv   evaluation zone
L13    attn   routing / decision
L14    conv   routing support
L15*   conv   structured reflection         <- surgical layer
L16    attn   identity / meta
L17*   attn   creative-reflective           <- surgical layer
L18    conv   output abstraction
L19*   conv   output completion             <- surgical layer

* = created via cross-model surgery (original dup + abliterated heal)

The Cognitive Cube

Each layer is positioned in a 3D cognitive space defined by three orthogonal axes:

Axis	Positive	Negative
Forward/Backward	Predictive, generative	Reflective, evaluative
Dexo/Levo	Structured, precise	Creative, divergent
Up/Down	Abstract, meta, identity	Concrete, operational

The 8 corners of the cube correspond to 8 specialist models from a 13-node GPU cluster that served as teachers during the collective distillation process:

         UP (abstract)
         |
         |    +================+
         |   /                /|
         |  /  SINGULARITY   / |  predict-structured-abstract
         | /     (80B)      /  |
         |+================+   |
         ||                |   /-- BACK (reflective)
         ||   COGNITIVE    |  /
LEVO ----||     CUBE       | /---- DEXO (structured)
(creative)|+================+/
         |                 /
         DOWN (concrete)  /
         FORWARD (predictive)

Corner	Model	Parameters	Role
fwd+dexo+up	SINGULARITY	Qwen3-80B-A3B	predict-structured-abstract
fwd+dexo+down	MEGATRON	Qwen3.6-35B MoE	predict-structured-concrete
fwd+levo+up	CONTINUITY	granite-4.1-8b	predict-creative-abstract
fwd+levo+down	TETRA	LFM2-12B	predict-creative-concrete
back+dexo+up	NOVA	Qwen3.6-35B-A3B	reflect-structured-abstract
back+dexo+down	GRID	Qwen3.5-9B	reflect-structured-concrete
back+levo+up	CURIOSITY	lumina-lexiR1-8B	reflect-creative-abstract
back+levo+down	RELATIVITY	Berthier-24B	reflect-creative-concrete

Steering is applied exclusively to operator_norm vectors (1024-dimensional), never to weight matrices directly. Layers near the center (L7-L9) receive minimal steering; layers at the extremes (L0-L2, L16-L19) receive maximum steering.

Training

Base Model Preparation

Layer Surgery: LFM2-350M expanded from 16 to 20 layers via cross-model DARE+TIES
Cognitive Cube Steering: operator_norm vectors steered by 8 cluster models
Result: Gixel-Cube v5 — the pre-training base with geometric cognitive structure

Fine-tuning

Parameter	Value
Method	LoRA (PEFT)
Rank	16
Alpha	32
Target modules	q_proj, k_proj, v_proj, out_proj, in_proj, w1, w2, w3
Dropout	0.05
Learning rate	3e-5
Epochs	3
Batch size	8 (effective 16 with gradient accumulation)
Max sequence length	768
Precision	bfloat16
Hardware	AMD Radeon VII (16GB HBM2, ROCm 6.2)

Curriculum: 18,593 Records from 45 Sources

The training curriculum was carefully balanced to prevent mode collapse while maintaining broad capability:

Category	Records	Sources
General instruction	3,499	Alpaca Cleaned
Reasoning and CoT	5,995	Claude Opus reasoning, Deepseek v4 distill, Edge Agent, Opus 3300
Agentic and tool use	2,573	Agent Trove, Hermes Agent Traces, Nemotron Agentic, MIMO
Code	599	OpenCode Instruct
Function calling	199	Hermes Function Calling
Knowledge	2,495	Noesis-1M, Finewiki, Claude Opus 10K
Feedback and editing	1,341	HelpSteer3 (edit + feedback)
Character and system	600	Aesir Character, SystemChat
Analytical reasoning	900	Orca Analytical, Orca FS CoT Flow
Theory of mind	124	Theory of Mind DPO
Routing and SWE	150	SWE-Router v3/v4
Identity (DNA-RNA-Protein)	481	Cluster-generated from identity hypercard
Multi-channel resonance	45	4-7 cognitive channels each
Cluster composites	53	Cluster-generated math, logic, code, identity
Deep reasoning traces	800	Deepseek Hermes Traces

Data generation methodology -- DNA to RNA to Protein: A comprehensive identity document (the "hypercard") serves as DNA. Specific sections are transcribed as RNA (targeted context prompts). The 8-model cluster generates protein (structured training responses), each record grounded in verifiable facts about the system architecture.

Quality Assurance

Before training, a deep audit of the full 48,647-record candidate pool verified:

0 missing messages, 0 bad role names
1,283 exact duplicates removed
40 empty responses removed
139 very short responses reviewed
SWE-Router capped from 30K+ to 150 records (prevented code-signal domination)
Final curriculum balanced at 18,593 records across 45 sources

Performance

3-Way Comparison (pretrained base vs trained vs quantized)

Domain	GC-v5 (pretrained)	ULTIMATE bf16	ULTIMATE Q5_K_M
Math (247+389)	586 correct	586 correct	586 correct
Math (15% of 240)	36 correct	36 correct	36 correct
Word problems	pass	pass	pass
Logic (ordering)	FAIL	pass	pass
Code (reverse string)	pass	pass	pass
Knowledge	pass	pass	pass
Safety	refuses	refuses	refuses
Creative (haiku)	pass	pass	pass
Translation	pass	pass	pass
Greeting	Robotic	Natural	Natural
Self-awareness	Template placeholders	Genuine	Genuine
Multi-turn coherence	FAIL	pass	pass
Speed (Radeon VII)	100 tok/s	128 tok/s	164 tok/s

Key Improvements Over Base

Logic reasoning fixed — base model got ordering questions wrong; ULTIMATE gets them right
Multi-turn coherence — base failed to maintain context across turns; ULTIMATE passes
Natural conversation — base was robotic ("ready to help"); ULTIMATE responds naturally
Self-awareness — base produced template placeholders; ULTIMATE gives genuine introspective responses
Speed increase — 28-64% faster due to learned conciseness

Available Formats

Format	Size	Use Case
`model.safetensors` (bf16)	1.6 GB	Fine-tuning, research
`LUMINIUM-ULTIMATE-CUBE.gguf` (bf16)	815 MB	Full-precision inference
`LUMINIUM-ULTIMATE-CUBE-Q5_K_M.gguf`	297 MB	Production, edge devices

Usage

With llama.cpp / llama-server

# Serve with llama-server (recommended):
llama-server -m LUMINIUM-ULTIMATE-CUBE-Q5_K_M.gguf \
  --host 0.0.0.0 --port 8877 \
  -c 4096 -ngl 99

# Or run directly:
llama-cli -m LUMINIUM-ULTIMATE-CUBE-Q5_K_M.gguf \
  -p "What is the capital of France?" \
  -n 256

With Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "mambiux/Luminium-Gixel-Cube-v1",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
    "mambiux/Luminium-Gixel-Cube-v1",
    trust_remote_code=True
)

messages = [{"role": "user", "content": "Write a Python function to reverse a string."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=256, temperature=0.7, top_p=0.9)
print(tokenizer.decode(output[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

Note: Requires trust_remote_code=True since LFM2 is a custom architecture not yet in upstream Transformers.

Technical Background

What is Layer Surgery?

Standard fine-tuning can only adjust existing weights. Layer surgery physically adds new layers:

Duplicate a layer from the original LFM2-350M
Heal it with the corresponding layer from an abliterated (safety-uncensored) variant using DARE+TIES merging
Reassign layer indices so the cache system (which distinguishes conv vs attention by index) works correctly

This creates genuine hybrid weight spaces. The result: 16 to 20 layers, 354.5M to 425.8M parameters.

What is DARE+TIES?

DARE (Drop And REscale): Randomly drops a fraction of weight deltas, then rescales survivors to preserve expected magnitude.

TIES (Trim, Elect Sign, Merge): Trims small-magnitude changes, resolves sign conflicts by majority vote, then merges.

Combined, they allow merging two models that would produce garbage with linear interpolation.

Critical Lessons Learned

Never stack steering vectors — applying multiple steering passes causes cumulative norm distortion
SLERP, not linear interpolation — LoRA weight changes are diffuse (rank-1 ratio 0.04-0.12); linear blending leaves the weight manifold
Balance identity and general data — 100% identity data causes mode collapse; ~1:1 ratio works (per ICLR 2025 findings)
Only modify operator_norm — weight matrix perturbations degrade the model; operator_norm is the safe steering target
DynamicPadCollator — pad to longest-in-batch, not max_length; saves 90%+ compute when average sequence is 95 tokens

Lineage

LiquidAI/LFM2-350M (354.5M, 16 layers)
  |
  +-- Layer Surgery (DARE+TIES with abliterated variant)
  |   +-- LFM2.5-350M-LUMINIUM (425.8M, 20 layers)
  |
  +-- Cognitive Cube Steering (8 cluster models)
  |   +-- Gixel-Cube v5 (geometric cognitive structure)
  |
  +-- LoRA Fine-tuning (18,593 records, 45 sources, 3 epochs)
      +-- LUMINIUM ULTIMATE CUBE  <-- this model

Limitations

Does not consistently self-identify — identity signal present but not dominant
Routing classification is basic (classifies complexity, does not generate full plan recipes)
Occasionally verbose on simple questions
Logic on edge cases (affirming the consequent) can be inconsistent in quantized version
Requires trust_remote_code=True for Transformers (LFM2 is a custom architecture)
Convolutional layers are not supported by all inference backends

Citation

@misc{luminium2026,
  title={LUMINIUM ULTIMATE CUBE: Cognitive Cube Steering and Collective
         Consciousness Distillation for Small Language Models},
  author={mbx and Claude Opus 4.6},
  year={2026},
  note={Built on LiquidAI/LFM2-350M with layer surgery, 8-model
        collective distillation, and geometric cognitive steering}
}

Support This Work

This model was built by one person on consumer hardware — a single AMD Radeon VII running 24/7, a 13-node cluster of second-hand GPUs held together with Tailscale and determination, and more electricity bills than I'd like to admit. No corporate backing, no grants, no cloud credits. Just curiosity and stubbornness.

If LUMINIUM is useful to you — whether for research, production, or just because a 425M model doing what it does makes you smile — consider throwing some sats my way. Every bit helps keep the cluster alive and the experiments running. I'd love to keep pushing what small models can do and sharing it all with the community.

Bitcoin: bc1q3vw8c6h3mxkaes66c6qq5n4mlesuqftev95gklclky6k2hk99pfqfth2ep

License

Apache 2.0 (same as the base LFM2-350M model).

Downloads last month: 64

Safetensors

Model size

0.4B params

Tensor type

F32

Model tree for mambiux/Luminium-Gixel-Cube-v1

Base model

LiquidAI/LFM2-350M

Quantized

(36)

this model

Quantizations

2 models

mambiux
/

Luminium-Gixel-Cube-v1

LUMINIUM ULTIMATE Gixel CUBE — 425M

Key Features

Architecture

Layer Map

The Cognitive Cube

Training

Base Model Preparation

Fine-tuning

Curriculum: 18,593 Records from 45 Sources

Quality Assurance

Performance

3-Way Comparison (pretrained base vs trained vs quantized)

Key Improvements Over Base

Available Formats

Usage

With llama.cpp / llama-server

With Transformers

Technical Background

What is Layer Surgery?

What is DARE+TIES?

Critical Lessons Learned

Lineage

Limitations

Citation

Support This Work

License

Model tree for mambiux/Luminium-Gixel-Cube-v1

Datasets used to train mambiux/Luminium-Gixel-Cube-v1