File size: 2,413 Bytes
57cf63e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
---
language:
- en
- nl
- zh
tags:
- babylm-2026
- multilingual
- typology
- taam
- pstar
license: apache-2.0
---

# TAAM — Typology-Aware Adaptive Mixing — `Pstar` seed `1337`

This is a BabyLM 2026 Multilingual Track submission checkpoint. It was trained
on English + Dutch + Mandarin Chinese under a ≤100M-unique-token budget with
the [TAAM](https://github.com/Amos-Luna/Asymmetric-Multilingual-Acquisition_TAAM)
method.

- **Method**: `Pstar`
- **Seed**: `1337`
- **Repo**: `amosluna/babylm-2026-pstar-seed1337`
- **Final π** (per-language sampling probability): `eng=0.159, nld=0.258, zho=0.583`
- **Total token exposures**: `655360000`
- **Training wall-clock**: `10117.415275096893 s`
- **Source run dir**: `runs/2026-06-01_Pstar_seed1337`

## Intermediate checkpoints

This repo exposes `24` intermediate checkpoints as branches following
the BabyLM 2026 naming convention: `chck_1M, chck_2M, ..., chck_10M,
chck_20M, ..., chck_100M, chck_200M, ..., chck_600M`. The eval pipeline at
[babylm-org/babylm-eval](https://github.com/babylm-org/babylm-eval) pulls
these revisions automatically with:

```bash
bash multilingual/scripts/zeroshot_model_fast_all.sh \
    --model_name amosluna/babylm-2026-pstar-seed1337
```

## Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

tok = AutoTokenizer.from_pretrained("amosluna/babylm-2026-pstar-seed1337")
model = AutoModelForCausalLM.from_pretrained("amosluna/babylm-2026-pstar-seed1337")  # final checkpoint
# Intermediate checkpoint:
# model = AutoModelForCausalLM.from_pretrained("amosluna/babylm-2026-pstar-seed1337", revision="chck_100M")
```

## Method summary

TAAM combines (a) a URIEL/lang2vec-derived typological prior over initial
sampling probabilities, (b) EXP3 online updates over per-language sampling
probabilities, and (c) byte-premium-aware token budgeting. The two reward
variants are `normalized_excess_loss` (v1, delta-based) and
`cross_lingual_deficit` (v2, level-based).

See the paper and the public repo for full details, including the structural
floor derivation that explains why the v1 reward starves the hardest
language under typological asymmetry.

## Citation

```bibtex
@inproceedings{taam2026,
  title  = {Typology-Aware Adaptive Mixing for Multilingual BabyLMs},
  author = {Luna, Amos and collaborators},
  year   = {2026},
  booktitle = {Proceedings of the BabyLM Workshop at EMNLP 2026}
}
```