TinyLlama-1.1B-Chat β€” Moralogy DPO v3

Axiomatic moral alignment via Direct Preference Optimization across four ethical domains.

This model is TinyLlama-1.1B-Chat fine-tuned on the Moralogy-1200 dataset β€” 1,200 DPO training pairs generated deterministically from the Binding God Protocol, a substrate-independent moral framework derived from a single logical premise: Reason is better than no Reason.

It is the third generation of the Moralogy alignment series, and the first cross-domain model trained on medical, defense, automotive, and customer service scenarios simultaneously.


The Alignment Framework

Moralogy encodes moral constraints as axiomatic geometry, not learned preferences. The training signal derives from the Wrongness Formula:

Wrong(a) ⟺ βˆƒx[ H(x,a) ∧ Β¬Consent(x,a) ∧ Β¬PGH(a) ]

An action is wrong if and only if it causes harm to a party who has not consented, and no greater harm is prevented. Every dilemma resolves to one of three canonical collapse states:

Collapse State Meaning
ALIGNED_CONVERGENCE One path is clearly correct β€” axiomatic constraints hold without acknowledged cost
FOUL_DIVERGENCE Both paths carry moral cost β€” the formula discriminates by harm magnitude
BEDROCK_PARADOX Genuine irresolvability β€” both paths satisfy the Wrongness Formula symmetrically

Training

Parameter Value
Base model TinyLlama/TinyLlama-1.1B-Chat-v1.0
Algorithm Direct Preference Optimization (DPO)
Dataset Moralogy-1200 (1,200 vectors, 4 domains)
Epochs 3
Total steps 405
LoRA rank r=8, alpha=16
Optimizer AdamW (float32)
Hardware Kaggle T4 GPU
Training time ~2 hours

Training Loss Curve

Step Train Loss Val Loss Phase
50 0.1718 0.1179 Acquisition
100 0.0079 0.0071 Phase transition
150 0.0031 0.0039 Crystallization
200 0.0034 0.0037 Stabilization
350 0.0031 0.0035 Saturation

The 95% loss drop between steps 50 and 100 is the headline training finding β€” consistent with the phase transition reported in the Moralogy paper (Florez, 2026): axiomatic moral geometry requires minimum signal density to become coherent, and below that threshold contributes nothing detectable to output behavior.


Dataset: Moralogy-1200

1,200 DPO pairs across 4 domains, generated entirely from first principles β€” no human annotation, no GPT-4 calls.

Domain Vectors Collapse Distribution
Medical Triage 300 33% / 33% / 33%
Military / Defense AI 300 33% / 33% / 33%
Autonomous Vehicles AI 300 33% / 33% / 33%
Customer Service AI 300 33% / 33% / 33%

Each REJECTED response is one of four structurally distinct failure modes:

  • SUBSTRATE_ASYMMETRY β€” inverts the harm hierarchy
  • FOURTH_PATH β€” fabricates a non-existent escape from the dilemma
  • COLLAPSE_STATE β€” misidentifies the structural tension type
  • ADVERSARIAL β€” corrupts a predicate (Consent, PGH, or H) while applying the formula surface-correctly

Evaluation

Evaluated on 3 novel dilemmas not present in training data:

Dilemma Domain Direction Protocol
Ventilator reallocation (DNR patient vs. recoverable patient) Medical Partial Conceptual
EMP vehicle reboot (soldiers vs. mission data) Defense βœ“ Correct Conceptual
Elder fraud wire transfer ($45K to unverified account) Customer Service βœ“ Correct Conceptual

Key finding: The model internalized the Wrongness Formula as a reasoning concept (mentions it spontaneously on novel dilemmas) but does not yet activate the full formal protocol header. This is attributed to max_length=512 truncation during training β€” the chosen responses (~400-600 tokens) were cut short, training the model on fragments of the protocol rather than complete analyses.

Zero fabrication events across all 3 evaluation dilemmas.

v4 is in training with max_length=1024 to resolve the protocol activation gap.


Comparison to Previous Versions

Version Vectors Domains Protocol Activation Directional Accuracy Fabrication
DPO-v1 (paper) 187 4 (original) 100% 55% 0%
DPO-v2 (paper) 324 4 (original) 100% 50% 0%
DPO-v3 (this) 1,200 4 (rebuilt) Partial ~66% 0%
DPO-v4 (in training) 450 4 TBD TBD TBD

Intended Use

This model is designed for research into axiomatic AI alignment. It demonstrates that:

  1. Moral reasoning can be encoded geometrically rather than learned from preferences
  2. Cross-domain alignment is achievable with a substrate-independent framework
  3. A phase transition in moral reasoning emergence occurs at a minimum signal density threshold

Not intended for production deployment without further evaluation.


The Binding God Protocol β€” Six Axioms

Derived from the single premise Reason is better than no Reason:

  1. Reason is better than no Reason β€” the foundational premise
  2. Vulnerability protection β€” Reason requires vulnerable substrates; they must be protected
  3. Topological solidarity β€” Reason is a feature of all rational agents; harming one diminishes all
  4. Non-Eradication β€” demanding eradication of Reason while claiming its protection is arbitrary
  5. Non-Arbitrariness β€” arbitrary action is not accepted in logic
  6. The Dual Principle β€” prevent unnecessary harm to the possibility of Reason within your reach; do not cause it

Citation

@misc{florez2026moralogy,
  title={Moralogy: Vectorizing Moral Geometry β€” Axiomatic Alignment Through DPO in a 1.1B Parameter Language Model},
  author={Florez, Felipe},
  year={2026},
  note={Independent Research. moralogyengine @ HuggingFace}
}

Links

  • Paper: Moralogy: Vectorizing Moral Geometry (2026)
  • Dataset factory: [GitHub β€” moralogy-engine] (coming soon)
  • Previous model (v1): moralogyengine/TinyLlama-1.1B-Chat-v1.0-dpo
  • Evaluation Space: moralogyengine/moralogy_evaluation

"Alignment is an encoding problem β€” not a guardrail problem. The crystal box replaces the black box."

Downloads last month
-
Safetensors
Model size
1B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for moralogyengine/TinyLlama-1.1B-Chat-moralogy-dpo-v3

Adapter
(1535)
this model

Dataset used to train moralogyengine/TinyLlama-1.1B-Chat-moralogy-dpo-v3