| --- |
| language: |
| - en |
| - nl |
| - zh |
| tags: |
| - babylm-2026 |
| - multilingual |
| - typology |
| - taam |
| - pstar |
| license: apache-2.0 |
| --- |
| |
| # TAAM — Typology-Aware Adaptive Mixing — `Pstar` seed `1337` |
|
|
| This is a BabyLM 2026 Multilingual Track submission checkpoint. It was trained |
| on English + Dutch + Mandarin Chinese under a ≤100M-unique-token budget with |
| the [TAAM](https://github.com/Amos-Luna/Asymmetric-Multilingual-Acquisition_TAAM) |
| method. |
|
|
| - **Method**: `Pstar` |
| - **Seed**: `1337` |
| - **Repo**: `amosluna/babylm-2026-pstar-seed1337` |
| - **Final π** (per-language sampling probability): `eng=0.159, nld=0.258, zho=0.583` |
| - **Total token exposures**: `655360000` |
| - **Training wall-clock**: `10117.415275096893 s` |
| - **Source run dir**: `runs/2026-06-01_Pstar_seed1337` |
|
|
| ## Intermediate checkpoints |
|
|
| This repo exposes `24` intermediate checkpoints as branches following |
| the BabyLM 2026 naming convention: `chck_1M, chck_2M, ..., chck_10M, |
| chck_20M, ..., chck_100M, chck_200M, ..., chck_600M`. The eval pipeline at |
| [babylm-org/babylm-eval](https://github.com/babylm-org/babylm-eval) pulls |
| these revisions automatically with: |
| |
| ```bash |
| bash multilingual/scripts/zeroshot_model_fast_all.sh \ |
| --model_name amosluna/babylm-2026-pstar-seed1337 |
| ``` |
| |
| ## Usage |
|
|
| ```python |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| |
| tok = AutoTokenizer.from_pretrained("amosluna/babylm-2026-pstar-seed1337") |
| model = AutoModelForCausalLM.from_pretrained("amosluna/babylm-2026-pstar-seed1337") # final checkpoint |
| # Intermediate checkpoint: |
| # model = AutoModelForCausalLM.from_pretrained("amosluna/babylm-2026-pstar-seed1337", revision="chck_100M") |
| ``` |
|
|
| ## Method summary |
|
|
| TAAM combines (a) a URIEL/lang2vec-derived typological prior over initial |
| sampling probabilities, (b) EXP3 online updates over per-language sampling |
| probabilities, and (c) byte-premium-aware token budgeting. The two reward |
| variants are `normalized_excess_loss` (v1, delta-based) and |
| `cross_lingual_deficit` (v2, level-based). |
|
|
| See the paper and the public repo for full details, including the structural |
| floor derivation that explains why the v1 reward starves the hardest |
| language under typological asymmetry. |
|
|
| ## Citation |
|
|
| ```bibtex |
| @inproceedings{taam2026, |
| title = {Typology-Aware Adaptive Mixing for Multilingual BabyLMs}, |
| author = {Luna, Amos and collaborators}, |
| year = {2026}, |
| booktitle = {Proceedings of the BabyLM Workshop at EMNLP 2026} |
| } |
| ``` |
|
|