--- license: apache-2.0 tags: - quantization - quantization-aware-training - bcjr - trellis-coded-quantization - llama - 2-bit base_model: meta-llama/Llama-3.2-1B language: - en library_name: transformers --- # BCJR-QAT-Llama-3.2-1B-2bit Trained 2-bit quantized weight snapshots for **Llama-3.2-1B** at single decoder layers, produced by **BCJR-QAT** — a differentiable relaxation of trellis-coded weight quantization. Companion artifacts to the paper > *BCJR-QAT: A Differentiable Relaxation of Trellis-Coded Weight Quantization* (V. Iyengar, 2026; arXiv preprint, pending) Code: [github.com/Venugopalan2610/quant-2bit](https://github.com/Venugopalan2610/quant-2bit) ## Headline results (WikiText-2 PPL) | Configuration | PPL | Δ vs QTIP-PTQ | |---|---|---| | FP16 baseline (no quantization) | 9.7000 | — | | QTIP-PTQ at L4 only (2 bpw) | 10.2189 | — | | **BCJR-QAT at L4 only, skip-high-T** (2 bpw) | **10.1347** | **−0.0842** | | QTIP-PTQ at L8 only (2 bpw) | 10.3083 | — | | BCJR-QAT at L8 only, naive schedule (2 bpw) | 10.3302 | +0.022 (overshoot) | | QTIP-PTQ at $[L_4, L_8]$ joint | 10.9134 | — | | **BCJR-QAT at $[L_4, L_8]$ joint** (mixed schedules) | **10.8364** | **−0.0770** (super-additive) | Other 14 layers stay at FP16 in all single- or two-layer configurations. The 2 bpw rate refers to the named layer(s) only. ## What's in this repo | Path | Bytes | Contents | |---|---|---| | `layer_04_skipT_wq.pt` | 232 MB | Hardened-Viterbi snapshot for **L4** under `T_init=0.3` schedule. **The headline winner.** | | `layer_04_skipT_trajectory/` | 1.2 GB | 5 per-step `W_latent` checkpoints (steps 2/4/6/8/10) for trajectory analysis. | | `layer_04_naive_wq.pt` | 232 MB | L4 hardened snapshot under `T_init=1.0` (the schedule-overshoot example). | | `layer_08_naive_wq.pt` | 232 MB | L8 hardened snapshot under `T_init=1.0`, used in multi-layer compounding test. | | `bench/30step/` | 1.6 GB | LR=2e-5 30-step reference run (sub-threshold drift; no codeword movement). | | `results/*.json` | ~10 KB | Trajectory-eval and multi-layer-eval JSONs reproducing the paper's tables. | | `bootstrap/perwin_bcjr_n4.npz` | 3 KB | OLMoE per-window NLLs for bootstrap analysis (companion to OLMoE results in the paper). | ## How to load and evaluate The snapshots are PyTorch `.pt` files saved as a dict `{"attn_q_proj": tensor, "attn_k_proj": tensor, ..., "mlp_down_proj": tensor}`, one entry per quantized linear in the wrapped layer (7 entries: q/k/v/o/gate/up/down). They install in-place into a fresh FP16 Llama-3.2-1B as follows: ```python import torch from transformers import AutoModelForCausalLM from src.qat.eval_llama_layer import install_layer_weights model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.2-1B", torch_dtype=torch.float16, device_map="cuda") snap = torch.load("layer_04_skipT_wq.pt", weights_only=True) install_layer_weights(model, layer_idx=4, snap_or_fn=snap, dtype=torch.float16) # now model has 2-bit BCJR-QAT-trained weights at layer 4, FP16 elsewhere ``` The `install_layer_weights` helper is in [`src/qat/eval_llama_layer.py`](https://github.com/Venugopalan2610/quant-olmoe/blob/master/src/qat/eval_llama_layer.py) in the companion repo. ## Reproduce the headline result from scratch The headline result was produced by ~5 hours on a single H100 SXM: ```bash git clone https://github.com/Venugopalan2610/quant-olmoe cd quant-olmoe bash scripts/vast_setup_llama.sh bash scripts/vast_train_llama_skip_highT.sh python -m scripts.eval_llama_trajectory \ --ckpt-dir cache/llama_bcjr_skipT \ --target-layer 4 \ --output results/llama_skipT_trajectory.json \ --skip-baselines --ppl-fp-cached 9.70 --ppl-ptq-cached 10.2189 ``` Recipe: PTQ-init, $\eta\!=\!2{\times}10^{-4}$, $N\!=\!10$ steps, $T$ schedule $0.3 \to 0.05$, BCJR chunk 16, sequence length 1024, batch size 1. ## Citation (Publish Pending) ```bibtex @article{iyengar2026bcjrqat, title = {BCJR-QAT: A Differentiable Relaxation of Trellis-Coded Weight Quantization}, author = {Iyengar, Venugopalan}, year = {2026}, journal = {arXiv preprint} } ``` Please also cite the upstream Llama-3.2 base model ([Llama 3.2 Community License](https://www.llama.com/llama3_2/license/), Meta) and the QTIP paper that this work builds on: ```bibtex @inproceedings{tseng2024qtip, title = {QTIP: Quantization with Trellises and Incoherence Processing}, author = {Tseng, Albert and Yao, Qingyao and Kuleshov, Volodymyr and De Sa, Christopher}, booktitle = {NeurIPS}, year = {2024} } ``` ## License Apache-2.0 for the trained weights and quantization metadata in this repository. Use of the underlying Llama-3.2-1B base model weights is governed by the [Llama 3.2 Community License](https://www.llama.com/llama3_2/license/).