---
language:
  - en
license: mit
library_name: lux
tags:
  - julia
  - lux
  - slm
  - philosophy
  - symbiogenesis
  - monarch-mixer
  - long-convolution
  - causal-conv
  - rmsnorm
  - swiglu
  - bpe
  - text-generation
pipeline_tag: text-generation
model-index:
  - name: SymbioSLM
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          type: LisaMegaWatts/philosophy-corpus
          name: philosophy-corpus
        metrics:
          - type: perplexity
            value: 79.9
            name: Val PPL (step 1000)
---

# SymbioSLM

A ~5M parameter decoder-only language model using the **Symbiogenesis** architecture — a novel multi-organelle sequence mixing design inspired by biological endosymbiosis (Margulis, 1967). Implemented entirely in Julia using Lux.jl and trained on classical philosophy texts.

## Architecture

Symbiogenesis replaces softmax attention with three complementary "organelles" per block, fused via a learned per-channel gate:

```
SymbioBlock (x6)
+-- RMSNorm
+-- SymbioSequenceMixer
|   +-- Organelle 1: CausalDepthwiseConv1d   (local n-gram patterns, K=4)
|   +-- Organelle 2: Multi-head MonarchMatrix (global sub-quadratic mixing)
|   +-- Organelle 3: LongConv                (global dense causal filter)
|   +-- OrganelleGate                        (per-channel softmax fusion)
+-- RMSNorm
+-- SwiGLU FFN
```

### How It Works

1. **CausalConv** captures local bigram/trigram/4-gram patterns via depthwise convolution (1 kernel per channel, length 4).

2. **Monarch matrices** provide global sequence mixing through factored M = P^T * BlockDiag(L1) * P * BlockDiag(L2), achieving 87.5% parameter reduction vs dense mixing (8,192 vs 65,536 params per head at T=256).

3. **LongConv** learns a full-length (T=256) causal filter per channel, enabling arbitrary position-dependent mixing.

4. **OrganelleGate** fuses all three via per-channel softmax: each of the 256 embedding channels independently learns which organelle to rely on.

No positional encoding (RoPE) is needed — the Monarch matrices and LongConv kernels implicitly learn position-dependent patterns.

## Model Details

| Parameter | Value |
|---|---|
| Architecture | Symbiogenesis (3 organelles + gate) |
| Parameters | ~4.1M |
| Embed dim | 256 |
| Layers | 6 |
| Monarch heads | 4 |
| Context length | 256 tokens |
| Vocabulary | 2,000 (ByteLevel BPE) |
| FFN | SwiGLU (hidden=640) |
| Normalization | RMSNorm (pre-norm) |
| Weight tying | Yes (shared input/output embeddings) |
| Precision | Float32 (F16 slower for Monarch block sizes) |

### Parameter Breakdown

| Component | Params | % |
|---|---|---|
| Token embedding (tied) | 512K | 12.6% |
| CausalConv (x6) | 6.1K | 0.2% |
| Monarch heads (x6, 4 heads each) | 197K | 4.8% |
| LongConv (x6) | 393K | 9.7% |
| OrganelleGate (x6) | 4.6K | 0.1% |
| SwiGLU FFN (x6) | 2.95M | 72.6% |
| RMSNorm (x13) | 3.3K | <0.1% |
| **Total** | **~4.1M** | |

### Sequence Mixing Efficiency

| | Transformer | Monarch | Symbiogenesis |
|---|---|---|---|
| Seq mixer params/block | 262K | 67K | 100K |
| Reduction vs Transformer | - | 74% | **62%** |
| Position encoding | RoPE (separate) | None | None |

## Training

| | Value |
|---|---|
| Dataset | [philosophy-corpus](https://huggingface.co/datasets/LisaMegaWatts/philosophy-corpus) |
| Corpus | 981 classical texts (Aristotle, Plato, Euclid, Descartes, Kant, Nietzsche, ...) |
| Train tokens | ~100M (Chinchilla-optimal: 20 tok/param) |
| Optimizer | AdamW (lr=1e-3, min_lr=1e-4, cosine decay) |
| Batch size | 32 |
| Hardware | NVIDIA RTX 3060 12GB |
| Throughput | ~19K tok/s (Float32) |
| Framework | Julia + Lux.jl + Zygote.jl + CUDA.jl |

### Training Progress (partial)

| Step | Train Loss | Val Loss | Val PPL | Gate Entropy |
|---|---|---|---|---|
| 1 | 17.10 | 17.03 | 24.9M | 1.099 |
| 500 | 6.50 | 4.92 | 137.5 | 1.098 |
| 1,000 | 4.43 | 4.38 | 79.9 | 1.094 |

### Gelation Monitoring

Training includes phase transition detection inspired by polymer physics:

- **CUSUM on loss curvature**: Detects sudden changes in 2nd derivative of loss curve
- **Gate entropy**: Tracks organelle specialization (1.099 = uniform, 0 = fully specialized)
- **Kuramoto order parameter**: Measures synchronization of block dynamics (R > 0.9 = gelation)

## Comparison with Other Julia SLM Variants

| | [JuliaSLM](https://huggingface.co/LisaMegaWatts/JuliaSLM) | [MonarchSLM](https://huggingface.co/LisaMegaWatts/MonarchSLM) | **SymbioSLM** |
|---|---|---|---|
| Architecture | Transformer | Monarch Mixer | Symbiogenesis |
| Sequence mixing | 4-head attention | 8-head Monarch + conv | 3 organelles + gate |
| Parameters | 5.04M | 4.98M | ~4.1M |
| Layers | 6 | 8 | 6 |
| Val PPL | **34.5** | 38.4 | TBD |
| Throughput | 26K tok/s | 19K tok/s | 19K tok/s |
| Position encoding | RoPE | None | None |

## Usage

### Generate with Julia

```julia
using Pkg; Pkg.activate("julia-slm")
include("src/JuliaGPT.jl")
using .JuliaGPT
using .JuliaGPT: Lux, CUDA

tok = BPETokenizer("vocab.json", "merges.txt")
device = Lux.gpu_device()
ps, st, _, step, val_loss = load_checkpoint("final.jld2"; device)

model = create_model(ModelConfig(;
    arch="symbiogenesis", vocab_size=vocab_size(tok),
    embed_dim=256, n_layers=6, n_heads=4, head_dim=64,
    n_monarch_heads=4, conv_kernel_size=4,
    ffn_mult=4, context_length=256, weight_tying=true,
))

text = generate(model, ps, st, tok, "the nature of ";
    max_new_tokens=200, temperature=0.8, top_k=40)
println(text)
```

### OpenAI-Compatible API

The model is served via [SymbioSLM Space](https://huggingface.co/spaces/LisaMegaWatts/SymbioSLM):

```bash
curl -X POST https://lisamegawatts-symbioslm.hf.space/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "the nature of"}],
    "max_tokens": 200,
    "temperature": 0.8,
    "top_k": 40
  }'
```

Streaming supported with `"stream": true`.

## Files

| File | Description |
|---|---|
| `final.jld2` | Trained model parameters (JLD2 format) |
| `config.toml` | Model architecture configuration |
| `vocab.json` | BPE vocabulary (2000 tokens) |
| `merges.txt` | BPE merge rules |

## Biological Inspiration

The architecture is named after Lynn Margulis' theory of **symbiogenesis** (1967): the proposal that eukaryotic cells originated through the endosymbiotic fusion of distinct prokaryotic organisms. Mitochondria and chloroplasts retain their own DNA, demonstrating their origin as once-independent organisms that became specialized organelles within a larger cell.

Similarly, each SymbioBlock contains three "organelles" with different mathematical properties (local convolution, global structured mixing, global dense filtering) that are fused into a single functional unit through the learned OrganelleGate. The gate entropy tracks how strongly the network differentiates between organelles — analogous to the degree of specialization achieved through evolutionary integration.

## Citation

```bibtex
@misc{symbioslm2026,
  title={Symbiogenesis: Multi-Organelle Sequence Mixing for Small Language Models},
  author={LisaMegaWatts},
  year={2026},
  url={https://huggingface.co/LisaMegaWatts/SymbioSLM}
}
```

## References

- Margulis, L. (1967). On the origin of mitosing cells. *J. Theoretical Biology*, 14(3), 225-274.
- Dao, T., et al. (2023). Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture. *NeurIPS 2023*.
- Poli, M., et al. (2023). Hyena Hierarchy: Towards Larger Convolutional Language Models. *ICML 2023*.
- Gu, A. & Dao, T. (2023). Mamba: Linear-Time Sequence Modeling with Selective State Spaces.

## License

MIT