---
license: apache-2.0
tags:
  - quantization
  - quantization-aware-training
  - bcjr
  - trellis-coded-quantization
  - llama
  - 2-bit
base_model: meta-llama/Llama-3.2-1B
language:
  - en
library_name: transformers
---

# BCJR-QAT-Llama-3.2-1B-2bit

Trained 2-bit quantized weight snapshots for **Llama-3.2-1B** at single
decoder layers, produced by **BCJR-QAT** — a differentiable relaxation of
trellis-coded weight quantization. Companion artifacts to the paper

> *BCJR-QAT: A Differentiable Relaxation of Trellis-Coded Weight Quantization* (V. Iyengar, 2026; arXiv preprint, pending)

Code:
[github.com/Venugopalan2610/quant-2bit](https://github.com/Venugopalan2610/quant-2bit)

## Headline results (WikiText-2 PPL)

| Configuration | PPL | Δ vs QTIP-PTQ |
|---|---|---|
| FP16 baseline (no quantization) | 9.7000 | — |
| QTIP-PTQ at L4 only (2 bpw) | 10.2189 | — |
| **BCJR-QAT at L4 only, skip-high-T** (2 bpw) | **10.1347** | **−0.0842** |
| QTIP-PTQ at L8 only (2 bpw) | 10.3083 | — |
| BCJR-QAT at L8 only, naive schedule (2 bpw) | 10.3302 | +0.022 (overshoot) |
| QTIP-PTQ at $[L_4, L_8]$ joint | 10.9134 | — |
| **BCJR-QAT at $[L_4, L_8]$ joint** (mixed schedules) | **10.8364** | **−0.0770** (super-additive) |

Other 14 layers stay at FP16 in all single- or two-layer configurations.
The 2 bpw rate refers to the named layer(s) only.

## What's in this repo

| Path | Bytes | Contents |
|---|---|---|
| `layer_04_skipT_wq.pt` | 232 MB | Hardened-Viterbi snapshot for **L4** under `T_init=0.3` schedule. **The headline winner.** |
| `layer_04_skipT_trajectory/` | 1.2 GB | 5 per-step `W_latent` checkpoints (steps 2/4/6/8/10) for trajectory analysis. |
| `layer_04_naive_wq.pt` | 232 MB | L4 hardened snapshot under `T_init=1.0` (the schedule-overshoot example). |
| `layer_08_naive_wq.pt` | 232 MB | L8 hardened snapshot under `T_init=1.0`, used in multi-layer compounding test. |
| `bench/30step/` | 1.6 GB | LR=2e-5 30-step reference run (sub-threshold drift; no codeword movement). |
| `results/*.json` | ~10 KB | Trajectory-eval and multi-layer-eval JSONs reproducing the paper's tables. |
| `bootstrap/perwin_bcjr_n4.npz` | 3 KB | OLMoE per-window NLLs for bootstrap analysis (companion to OLMoE results in the paper). |

## How to load and evaluate

The snapshots are PyTorch `.pt` files saved as a dict
`{"attn_q_proj": tensor, "attn_k_proj": tensor, ..., "mlp_down_proj":
tensor}`, one entry per quantized linear in the wrapped layer
(7 entries: q/k/v/o/gate/up/down). They install in-place into a fresh
FP16 Llama-3.2-1B as follows:

```python
import torch
from transformers import AutoModelForCausalLM
from src.qat.eval_llama_layer import install_layer_weights

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-1B", torch_dtype=torch.float16, device_map="cuda")
snap = torch.load("layer_04_skipT_wq.pt", weights_only=True)
install_layer_weights(model, layer_idx=4, snap_or_fn=snap, dtype=torch.float16)
# now model has 2-bit BCJR-QAT-trained weights at layer 4, FP16 elsewhere
```

The `install_layer_weights` helper is in
[`src/qat/eval_llama_layer.py`](https://github.com/Venugopalan2610/quant-olmoe/blob/master/src/qat/eval_llama_layer.py)
in the companion repo.

## Reproduce the headline result from scratch

The headline result was produced by ~5 hours on a single H100 SXM:

```bash
git clone https://github.com/Venugopalan2610/quant-olmoe
cd quant-olmoe
bash scripts/vast_setup_llama.sh
bash scripts/vast_train_llama_skip_highT.sh
python -m scripts.eval_llama_trajectory \
    --ckpt-dir cache/llama_bcjr_skipT \
    --target-layer 4 \
    --output results/llama_skipT_trajectory.json \
    --skip-baselines --ppl-fp-cached 9.70 --ppl-ptq-cached 10.2189
```

Recipe: PTQ-init, $\eta\!=\!2{\times}10^{-4}$, $N\!=\!10$ steps, $T$
schedule $0.3 \to 0.05$, BCJR chunk 16, sequence length 1024,
batch size 1.

## Citation

(Publish Pending)
```bibtex
@article{iyengar2026bcjrqat,
  title   = {BCJR-QAT: A Differentiable Relaxation of Trellis-Coded
             Weight Quantization},
  author  = {Iyengar, Venugopalan},
  year    = {2026},
  journal = {arXiv preprint}
}
```

Please also cite the upstream Llama-3.2 base model
([Llama 3.2 Community License](https://www.llama.com/llama3_2/license/),
Meta) and the QTIP paper that this work builds on:

```bibtex
@inproceedings{tseng2024qtip,
  title     = {QTIP: Quantization with Trellises and Incoherence Processing},
  author    = {Tseng, Albert and Yao, Qingyao and Kuleshov, Volodymyr and
               De Sa, Christopher},
  booktitle = {NeurIPS},
  year      = {2024}
}
```

## License

Apache-2.0 for the trained weights and quantization metadata in this
repository. Use of the underlying Llama-3.2-1B base model weights is
governed by the
[Llama 3.2 Community License](https://www.llama.com/llama3_2/license/).