---
license: mit
language:
- en
tags:
- language-model
- cpu-trained
- cortex
- flashlm
- small-language-model
---

# FlashLM v8.3 — CORTEX-VIII

**CPU-trained language model. 6.57M parameters. Trained from scratch in 2 hours on a free-tier cloud CPU.**

---

## Architecture

CORTEX-VIII combines two complementary attention mechanisms per layer:

| Component | Role | Config |
|-----------|------|--------|
| **Sliding Window Attention** | Local context (W=32 tokens) | 4 heads, d_head=64 |
| **Gated Delta Memory** | Global context via delta rule | d_mem=32, learnable decay |
| **Lookahead Value Heads** | Predict future loss for search-guided decoding | 1 per layer |
| **SwiGLU FFN** | Nonlinear mixing | d_ff=512 |
| **RMSNorm** | Layer normalization | Pre-norm |
| **Weight Tying** | Share embed/output weights | — |

Additional training features:

- **Entropy regularization** (weight=0.01) — prevents peaked distributions that cause repetition
- **Nucleus sampling** (top_p=0.85) + **frequency penalty** (1.2) at generation time
- **Zero weight decay** on embedding/output layers to preserve low-frequency token representations

## Training Details

| Metric | Value |
|--------|-------|
| **Dataset** | TinyStories V2-GPT4 |
| **Training subset** | First 10M tokens (~1.3 epochs) |
| **Hardware** | 2 vCPU / 5GB RAM (free-tier cloud) |
| **Training time** | 2 hours |
| **Validation PPL** | 2.50 (best) |
| **Throughput** | 1,861 tokens/sec |
| **Steps** | 1,636 |
| **Total tokens seen** | 13.4M |
| **Batch size** | 4 x 8 gradient accumulation |
| **Peak LR** | 5e-4 (cosine decay to 1e-5) |
| **Warmup** | 100 steps |

## Model Lineup

| Version | Architecture | Params | PPL | Highlight |
|---------|-------------|-------:|----:|-----------|
| v7.4 CORTEX-VIII | Gated DeltaNet + SWA | 6.6M | **2.33** | Best PPL |
| v8.1 SearchLM | CORTEX + lookahead value heads | 6.6M | 2.40 | V_Corr +0.66 |
| v8.2 CORTEX-VIII | + 20M subset + entropy reg | 6.6M | 2.42 | Broke repetition loops |
| **v8.3 CORTEX-VIII** | **+ 10M subset, D_FF=512** | **6.6M** | **2.50** | **Best generation diversity** |
| v8.4 CORTEX-IX | + full context SWA + 2x memory | ~6.8M | TBD | In progress |

## Files

| File | Description |
|------|-------------|
| `best.pt` | Best checkpoint (lowest validation loss) |
| `final.pt` | Final checkpoint with full config and training results |
| `tokenizer.json` | Byte-level BPE tokenizer (vocab=4,096) |
| `results.json` | Training metrics summary |

## Usage

```python
import torch
from tokenizers import Tokenizer

# Load tokenizer
tokenizer = Tokenizer.from_file("tokenizer.json")

# Load model checkpoint
ckpt = torch.load("best.pt", map_location="cpu")
print(f"Val PPL: {ckpt['val_ppl']:.2f}")

# For full model architecture, see:
# https://github.com/changcheng967/FlashLM/blob/main/v8/train_v83.py
```

## Generation Example

```
Prompt: "Once upon a time"
Output: "Once upon a time . sun like . helped look this ! began bed to .
         thought cake a and fish him Tom Mr Bunny fish . looked Ben place !
         thinks book ..."
```

Generation uses nucleus sampling (temperature=1.2, top_p=0.85) with frequency penalty (1.2) to maximize diversity.

## Limitations

- **Grammar is broken** — the model learned vocabulary and word statistics (PPL 2.50) but not sentence structure. Greedy decoding produces repetition loops; sampling produces diverse but ungrammatical text.
- **SWA window too small** — W=32 (~8 words) can't capture cross-sentence dependencies needed for grammar.
- **Undertrained** — 13.4M tokens seen vs 574M in full dataset. The model needs more data coverage.
- v8.4 (CORTEX-IX) addresses these with full-context attention (W=256) and doubled memory capacity.

## Citation

```bibtex
@misc{flashlm,
  author = {Cheng Chang},
  title = {FlashLM: CPU-Native Ternary Language Models},
  year = {2026},
  url = {https://github.com/changcheng967/FlashLM}
}
```

## Links

- **GitHub:** [changcheng967/FlashLM](https://github.com/changcheng967/FlashLM)
- **Code:** [train_v83.py](https://github.com/changcheng967/FlashLM/blob/main/v8/train_v83.py)

---

Trained by **Cheng Chang**. Architecture design assistance by Claude Code (Anthropic).