--- license: mit language: - en tags: - language-model - cpu-trained - cortex - flashlm - small-language-model --- # FlashLM v8.3 — CORTEX-VIII **CPU-trained language model. 6.57M parameters. Trained from scratch in 2 hours on a free-tier cloud CPU.** --- ## Architecture CORTEX-VIII combines two complementary attention mechanisms per layer: | Component | Role | Config | |-----------|------|--------| | **Sliding Window Attention** | Local context (W=32 tokens) | 4 heads, d_head=64 | | **Gated Delta Memory** | Global context via delta rule | d_mem=32, learnable decay | | **Lookahead Value Heads** | Predict future loss for search-guided decoding | 1 per layer | | **SwiGLU FFN** | Nonlinear mixing | d_ff=512 | | **RMSNorm** | Layer normalization | Pre-norm | | **Weight Tying** | Share embed/output weights | — | Additional training features: - **Entropy regularization** (weight=0.01) — prevents peaked distributions that cause repetition - **Nucleus sampling** (top_p=0.85) + **frequency penalty** (1.2) at generation time - **Zero weight decay** on embedding/output layers to preserve low-frequency token representations ## Training Details | Metric | Value | |--------|-------| | **Dataset** | TinyStories V2-GPT4 | | **Training subset** | First 10M tokens (~1.3 epochs) | | **Hardware** | 2 vCPU / 5GB RAM (free-tier cloud) | | **Training time** | 2 hours | | **Validation PPL** | 2.50 (best) | | **Throughput** | 1,861 tokens/sec | | **Steps** | 1,636 | | **Total tokens seen** | 13.4M | | **Batch size** | 4 x 8 gradient accumulation | | **Peak LR** | 5e-4 (cosine decay to 1e-5) | | **Warmup** | 100 steps | ## Model Lineup | Version | Architecture | Params | PPL | Highlight | |---------|-------------|-------:|----:|-----------| | v7.4 CORTEX-VIII | Gated DeltaNet + SWA | 6.6M | **2.33** | Best PPL | | v8.1 SearchLM | CORTEX + lookahead value heads | 6.6M | 2.40 | V_Corr +0.66 | | v8.2 CORTEX-VIII | + 20M subset + entropy reg | 6.6M | 2.42 | Broke repetition loops | | **v8.3 CORTEX-VIII** | **+ 10M subset, D_FF=512** | **6.6M** | **2.50** | **Best generation diversity** | | v8.4 CORTEX-IX | + full context SWA + 2x memory | ~6.8M | TBD | In progress | ## Files | File | Description | |------|-------------| | `best.pt` | Best checkpoint (lowest validation loss) | | `final.pt` | Final checkpoint with full config and training results | | `tokenizer.json` | Byte-level BPE tokenizer (vocab=4,096) | | `results.json` | Training metrics summary | ## Usage ```python import torch from tokenizers import Tokenizer # Load tokenizer tokenizer = Tokenizer.from_file("tokenizer.json") # Load model checkpoint ckpt = torch.load("best.pt", map_location="cpu") print(f"Val PPL: {ckpt['val_ppl']:.2f}") # For full model architecture, see: # https://github.com/changcheng967/FlashLM/blob/main/v8/train_v83.py ``` ## Generation Example ``` Prompt: "Once upon a time" Output: "Once upon a time . sun like . helped look this ! began bed to . thought cake a and fish him Tom Mr Bunny fish . looked Ben place ! thinks book ..." ``` Generation uses nucleus sampling (temperature=1.2, top_p=0.85) with frequency penalty (1.2) to maximize diversity. ## Limitations - **Grammar is broken** — the model learned vocabulary and word statistics (PPL 2.50) but not sentence structure. Greedy decoding produces repetition loops; sampling produces diverse but ungrammatical text. - **SWA window too small** — W=32 (~8 words) can't capture cross-sentence dependencies needed for grammar. - **Undertrained** — 13.4M tokens seen vs 574M in full dataset. The model needs more data coverage. - v8.4 (CORTEX-IX) addresses these with full-context attention (W=256) and doubled memory capacity. ## Citation ```bibtex @misc{flashlm, author = {Cheng Chang}, title = {FlashLM: CPU-Native Ternary Language Models}, year = {2026}, url = {https://github.com/changcheng967/FlashLM} } ``` ## Links - **GitHub:** [changcheng967/FlashLM](https://github.com/changcheng967/FlashLM) - **Code:** [train_v83.py](https://github.com/changcheng967/FlashLM/blob/main/v8/train_v83.py) --- Trained by **Cheng Chang**. Architecture design assistance by Claude Code (Anthropic).