| --- |
| license: apache-2.0 |
| --- |
| # HROM-M1 |
|
|
| **HROM-M1** is a transformer-based Mixture-of-Experts (MoE) language model built entirely in PyTorch by me, *Timur Hromek*, a 15-year-old self-taught developer. It's designed for multi-turn, persona-aware dialogue with a focus on safety, modularity, and extensibility. |
|
|
| This implementation includes top-k expert routing, rotary position embeddings, SwiGLU activations, and a custom tokenizer, along with built-in safety filters and checkpoint management. |
|
|
| ## Features |
|
|
| - Mixture-of-Experts (MoE) with 8 experts and top-2 routing per token. |
| - Transformer architecture with 8 layers, 8 heads, and RoPE (rotary positional embeddings). |
| - SwiGLU activation for efficient MLP computation. |
| - Multi-dataset training support, including: |
| - `daily_dialog` |
| - `empathetic_dialogues` |
| - `blended_skill_talk` |
| - `persona-chat` |
| - `papahawk/conversational-01` |
| - Custom tokenizer using Byte-Pair Encoding (BPE). |
| - `SafetyManager` for blocking unsafe generations using token-level filtering. |
| - `CheckpointManager` with rotating save slots and auto-recovery. |
| - AMP (mixed precision) and gradient accumulation support. |
|
|
| ## Model Specs |
|
|
| | Hyperparameter | Value | |
| |--------------------------|----------------| |
| | Model Parameters | 370.46M | |
| | Embedding Size (dim) | 768 | |
| | Layers | 8 | |
| | Attention Heads | 8 | |
| | Expert FF Dim | 2048 | |
| | Number of Experts | 8 | |
| | Top-k Experts | 2 | |
| | Vocabulary Size | 32,000 | |
| | Max Sequence Length | 512 tokens | |
| | Dropout | 0.1 | |
| | Batch Size | 16 | |
| | Learning Rate | 2e-5 | |
| | Optimizer | AdamW | |
| | Epochs | 30 | |
| | Grad Accumulation Steps | 8 | |
|
|
| ## Architecture Overview |
|
|
| - `HROMBlock`: Transformer block with attention and MoE feedforward. |
| - `MoELayer`: Routes tokens to top-k experts and applies load balancing loss. |
| - `Expert`: Lightweight FFN with SwiGLU nonlinearity. |
| - `SafetyManager`: Filters generations using predefined token patterns. |
| - `TokenizerTrainer`: Builds a BPE tokenizer from dialogue data. |
| - `CheckpointManager`: Rotates and auto-recovers checkpoints. |
|
|
| ## Safety |
|
|
| The model includes a basic content filter that blocks sequences containing unsafe keywords by checking token IDs. Unsafe generations are interrupted before completion. |
|
|
| ## Installation |
|
|
| ```bash |
| git clone https://github.com/yourusername/HROM-M1.git |
| cd HROM-M1 |
| pip install -r requirements.txt |
| ``` |
|
|
| ## Training |
|
|
| ```bash |
| python HROM-M1.py |
| ``` |
|
|
| The tokenizer will auto-train if not found. Dialogue datasets are pulled via HuggingFace `datasets`.Dialogue datasets are pulled via HuggingFace `datasets`. |