HROMEK-Industries
/

HROM-M1

Model card Files Files and versions

HROM-M1 / README.md

TimurHromek's picture

Update README.md

2510871 verified about 1 year ago

|

History Blame Contribute Delete

2.88 kB

	---
	license: apache-2.0
	---
	# HROM-M1

	HROM-M1 is a transformer-based Mixture-of-Experts (MoE) language model built entirely in PyTorch by me, Timur Hromek, a 15-year-old self-taught developer. It's designed for multi-turn, persona-aware dialogue with a focus on safety, modularity, and extensibility.

	This implementation includes top-k expert routing, rotary position embeddings, SwiGLU activations, and a custom tokenizer, along with built-in safety filters and checkpoint management.

	## Features

	- Mixture-of-Experts (MoE) with 8 experts and top-2 routing per token.
	- Transformer architecture with 8 layers, 8 heads, and RoPE (rotary positional embeddings).
	- SwiGLU activation for efficient MLP computation.
	- Multi-dataset training support, including:
	- `daily_dialog`
	- `empathetic_dialogues`
	- `blended_skill_talk`
	- `persona-chat`
	- `papahawk/conversational-01`
	- Custom tokenizer using Byte-Pair Encoding (BPE).
	- `SafetyManager` for blocking unsafe generations using token-level filtering.
	- `CheckpointManager` with rotating save slots and auto-recovery.
	- AMP (mixed precision) and gradient accumulation support.

	## Model Specs

	\| Hyperparameter \| Value \|
	\|--------------------------\|----------------\|
	\| Model Parameters \| 370.46M \|
	\| Embedding Size (dim) \| 768 \|
	\| Layers \| 8 \|
	\| Attention Heads \| 8 \|
	\| Expert FF Dim \| 2048 \|
	\| Number of Experts \| 8 \|
	\| Top-k Experts \| 2 \|
	\| Vocabulary Size \| 32,000 \|
	\| Max Sequence Length \| 512 tokens \|
	\| Dropout \| 0.1 \|
	\| Batch Size \| 16 \|
	\| Learning Rate \| 2e-5 \|
	\| Optimizer \| AdamW \|
	\| Epochs \| 30 \|
	\| Grad Accumulation Steps \| 8 \|

	## Architecture Overview

	- `HROMBlock`: Transformer block with attention and MoE feedforward.
	- `MoELayer`: Routes tokens to top-k experts and applies load balancing loss.
	- `Expert`: Lightweight FFN with SwiGLU nonlinearity.
	- `SafetyManager`: Filters generations using predefined token patterns.
	- `TokenizerTrainer`: Builds a BPE tokenizer from dialogue data.
	- `CheckpointManager`: Rotates and auto-recovers checkpoints.

	## Safety

	The model includes a basic content filter that blocks sequences containing unsafe keywords by checking token IDs. Unsafe generations are interrupted before completion.

	## Installation

	```bash
	git clone https://github.com/yourusername/HROM-M1.git
	cd HROM-M1
	pip install -r requirements.txt
	```

	## Training

	```bash
	python HROM-M1.py
	```

	The tokenizer will auto-train if not found. Dialogue datasets are pulled via HuggingFace `datasets`.Dialogue datasets are pulled via HuggingFace `datasets`.