--- language: - en license: apache-2.0 library_name: gguf base_model: HuggingFaceTB/SmolLM2-1.7B pipeline_tag: text-generation tags: - gguf - llama-cpp - quantization - qat - quantization-aware-training - scheduled-qat - smollm2 - edge-deployment - int4 - int8 model_name: SmolLM2-1.7B Scheduled QAT Linear GGUF datasets: - wikitext quantized_by: jpcurada --- # SmolLM2-1.7B — Scheduled QAT (Linear Schedule) — GGUF GGUF quantized versions of [SmolLM2-1.7B](https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B), trained with **Scheduled Quantization-Aware Training** (linear bit-width reduction schedule) before quantization. > **Key insight:** Unlike naive Post-Training Quantization (PTQ), these weights were specifically trained to survive quantization. During training, precision was gradually reduced from FP32 → FP16 → INT8 → INT4 following a linear schedule, allowing the model to adapt its weights to quantization noise at each stage. ## Files | Filename | Quant | Size | BPW | Description | |----------|-------|------|-----|-------------| | `smollm2-1.7b-sched-qat-linear-Q4_K_M.gguf` | Q4_K_M | ~1.0 GB | 4.93 | **Recommended** — best quality/size ratio for edge deployment | | `smollm2-1.7b-sched-qat-linear-Q8_0.gguf` | Q8_0 | ~1.7 GB | 8.50 | Higher quality, larger size | ## Training Details | Parameter | Value | |-----------|-------| | **Base model** | [HuggingFaceTB/SmolLM2-1.7B](https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B) | | **Method** | Scheduled QAT (Linear bit-width reduction) | | **Training data** | WikiText-103 (4000 sequences × 512 tokens) | | **Hardware** | Kaggle TPU v5e-8 (8 cores) | | **Epochs** | 1 | | **Effective batch size** | 64 (4 per-core × 2 grad accum × 8 cores) | | **Learning rate** | 2e-5 (cosine decay) | | **Optimizer** | AdamW (weight_decay=0.01) | | **Training time** | ~1150 seconds | ### Bit-Width Schedule ``` Epoch: 0.0 ──── 0.1 ──────────── 0.9 ──── 1.0 Bits: FP32 │ Linear │ INT4 (warmup) │ 32→16→8→4 │ (stabilize) ``` | Phase | Epoch Range | Bit-width | |-------|------------|-----------| | Warmup | 0.0 → 0.1 | FP32 (no quantization noise) | | Linear reduction | 0.1 → 0.9 | 32 → 16 → 8 → 4 (gradual) | | Stabilization | 0.9 → 1.0 | INT4 (final fine-tuning) | ### QAT Training Results (WikiText-103 Test) | Metric | Value | |--------|-------| | Test loss | 3.0392 | | Test perplexity | 20.89 | ## How It Works 1. **Training (QAT):** Model weights are trained with fake quantization nodes that simulate INT4 rounding noise in every forward pass. The gradients learn to place weights near quantization grid points. 2. **Export (this repo):** The QAT-hardened bf16 weights are converted to GGUF format and quantized to actual INT4/INT8 using `llama.cpp`'s `llama-quantize`. 3. **Deployment:** The GGUF files run directly on edge devices via `llama.cpp` — Android, iOS, Raspberry Pi. Because QAT pre-adapted the weights for quantization, these GGUF files should retain more quality than naively quantized (PTQ) versions of the same model. ## Usage ### With llama.cpp CLI ```bash # Download wget https://huggingface.co/jpcurada/SmolLM2-1.7B-Scheduled-QAT-Linear-GGUF/resolve/main/smollm2-1.7b-sched-qat-linear-Q4_K_M.gguf # Run ./llama-cli -m smollm2-1.7b-sched-qat-linear-Q4_K_M.gguf \ -p "The future of artificial intelligence is" -n 100 ``` ### With llama-cpp-python ```python from llama_cpp import Llama llm = Llama(model_path="smollm2-1.7b-sched-qat-linear-Q4_K_M.gguf") output = llm("The future of AI is", max_tokens=100) print(output["choices"][0]["text"]) ``` ## Related - **bf16 weights (pre-quantization):** [jpcurada/SmolLM2-1.7B-Scheduled-QAT-Linear-INT4](https://huggingface.co/jpcurada/SmolLM2-1.7B-Scheduled-QAT-Linear-INT4) - **Base model:** [HuggingFaceTB/SmolLM2-1.7B](https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B) ## Citation This model is part of a thesis on Scheduled Quantization-Aware Training for Small Language Models targeting edge deployment. ## License Apache 2.0 (same as base model)