Instructions to use Venugopalan2610/BCJR-QAT-Llama-3.2-1B-2bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Venugopalan2610/BCJR-QAT-Llama-3.2-1B-2bit with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Venugopalan2610/BCJR-QAT-Llama-3.2-1B-2bit", dtype="auto") - Notebooks
- Google Colab
- Kaggle
BCJR-QAT-Llama-3.2-1B-2bit
Trained 2-bit quantized weight snapshots for Llama-3.2-1B at single decoder layers, produced by BCJR-QAT β a differentiable relaxation of trellis-coded weight quantization. Companion artifacts to the paper
BCJR-QAT: A Differentiable Relaxation of Trellis-Coded Weight Quantization (V. Iyengar, 2026; arXiv preprint, pending)
Code: github.com/Venugopalan2610/quant-2bit
Headline results (WikiText-2 PPL)
| Configuration | PPL | Ξ vs QTIP-PTQ |
|---|---|---|
| FP16 baseline (no quantization) | 9.7000 | β |
| QTIP-PTQ at L4 only (2 bpw) | 10.2189 | β |
| BCJR-QAT at L4 only, skip-high-T (2 bpw) | 10.1347 | β0.0842 |
| QTIP-PTQ at L8 only (2 bpw) | 10.3083 | β |
| BCJR-QAT at L8 only, naive schedule (2 bpw) | 10.3302 | +0.022 (overshoot) |
| QTIP-PTQ at $[L_4, L_8]$ joint | 10.9134 | β |
| BCJR-QAT at $[L_4, L_8]$ joint (mixed schedules) | 10.8364 | β0.0770 (super-additive) |
Other 14 layers stay at FP16 in all single- or two-layer configurations. The 2 bpw rate refers to the named layer(s) only.
What's in this repo
| Path | Bytes | Contents |
|---|---|---|
layer_04_skipT_wq.pt |
232 MB | Hardened-Viterbi snapshot for L4 under T_init=0.3 schedule. The headline winner. |
layer_04_skipT_trajectory/ |
1.2 GB | 5 per-step W_latent checkpoints (steps 2/4/6/8/10) for trajectory analysis. |
layer_04_naive_wq.pt |
232 MB | L4 hardened snapshot under T_init=1.0 (the schedule-overshoot example). |
layer_08_naive_wq.pt |
232 MB | L8 hardened snapshot under T_init=1.0, used in multi-layer compounding test. |
bench/30step/ |
1.6 GB | LR=2e-5 30-step reference run (sub-threshold drift; no codeword movement). |
results/*.json |
~10 KB | Trajectory-eval and multi-layer-eval JSONs reproducing the paper's tables. |
bootstrap/perwin_bcjr_n4.npz |
3 KB | OLMoE per-window NLLs for bootstrap analysis (companion to OLMoE results in the paper). |
How to load and evaluate
The snapshots are PyTorch .pt files saved as a dict
{"attn_q_proj": tensor, "attn_k_proj": tensor, ..., "mlp_down_proj": tensor}, one entry per quantized linear in the wrapped layer
(7 entries: q/k/v/o/gate/up/down). They install in-place into a fresh
FP16 Llama-3.2-1B as follows:
import torch
from transformers import AutoModelForCausalLM
from src.qat.eval_llama_layer import install_layer_weights
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-1B", torch_dtype=torch.float16, device_map="cuda")
snap = torch.load("layer_04_skipT_wq.pt", weights_only=True)
install_layer_weights(model, layer_idx=4, snap_or_fn=snap, dtype=torch.float16)
# now model has 2-bit BCJR-QAT-trained weights at layer 4, FP16 elsewhere
The install_layer_weights helper is in
src/qat/eval_llama_layer.py
in the companion repo.
Reproduce the headline result from scratch
The headline result was produced by ~5 hours on a single H100 SXM:
git clone https://github.com/Venugopalan2610/quant-olmoe
cd quant-olmoe
bash scripts/vast_setup_llama.sh
bash scripts/vast_train_llama_skip_highT.sh
python -m scripts.eval_llama_trajectory \
--ckpt-dir cache/llama_bcjr_skipT \
--target-layer 4 \
--output results/llama_skipT_trajectory.json \
--skip-baselines --ppl-fp-cached 9.70 --ppl-ptq-cached 10.2189
Recipe: PTQ-init, $\eta!=!2{\times}10^{-4}$, $N!=!10$ steps, $T$ schedule $0.3 \to 0.05$, BCJR chunk 16, sequence length 1024, batch size 1.
Citation
(Publish Pending)
@article{iyengar2026bcjrqat,
title = {BCJR-QAT: A Differentiable Relaxation of Trellis-Coded
Weight Quantization},
author = {Iyengar, Venugopalan},
year = {2026},
journal = {arXiv preprint}
}
Please also cite the upstream Llama-3.2 base model (Llama 3.2 Community License, Meta) and the QTIP paper that this work builds on:
@inproceedings{tseng2024qtip,
title = {QTIP: Quantization with Trellises and Incoherence Processing},
author = {Tseng, Albert and Yao, Qingyao and Kuleshov, Volodymyr and
De Sa, Christopher},
booktitle = {NeurIPS},
year = {2024}
}
License
Apache-2.0 for the trained weights and quantization metadata in this repository. Use of the underlying Llama-3.2-1B base model weights is governed by the Llama 3.2 Community License.
Model tree for Venugopalan2610/BCJR-QAT-Llama-3.2-1B-2bit
Base model
meta-llama/Llama-3.2-1B