---
language:
  - en
license: apache-2.0
library_name: gguf
base_model: HuggingFaceTB/SmolLM2-1.7B
pipeline_tag: text-generation
tags:
  - gguf
  - llama-cpp
  - quantization
  - qat
  - quantization-aware-training
  - scheduled-qat
  - smollm2
  - edge-deployment
  - int4
  - int8
model_name: SmolLM2-1.7B Scheduled QAT Linear GGUF
datasets:
  - wikitext
quantized_by: jpcurada
---

# SmolLM2-1.7B — Scheduled QAT (Linear Schedule) — GGUF

GGUF quantized versions of [SmolLM2-1.7B](https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B), trained with **Scheduled Quantization-Aware Training** (linear bit-width reduction schedule) before quantization.

> **Key insight:** Unlike naive Post-Training Quantization (PTQ), these weights were specifically trained to survive quantization. During training, precision was gradually reduced from FP32 → FP16 → INT8 → INT4 following a linear schedule, allowing the model to adapt its weights to quantization noise at each stage.

## Files

| Filename | Quant | Size | BPW | Description |
|----------|-------|------|-----|-------------|
| `smollm2-1.7b-sched-qat-linear-Q4_K_M.gguf` | Q4_K_M | ~1.0 GB | 4.93 | **Recommended** — best quality/size ratio for edge deployment |
| `smollm2-1.7b-sched-qat-linear-Q8_0.gguf` | Q8_0 | ~1.7 GB | 8.50 | Higher quality, larger size |

## Training Details

| Parameter | Value |
|-----------|-------|
| **Base model** | [HuggingFaceTB/SmolLM2-1.7B](https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B) |
| **Method** | Scheduled QAT (Linear bit-width reduction) |
| **Training data** | WikiText-103 (4000 sequences × 512 tokens) |
| **Hardware** | Kaggle TPU v5e-8 (8 cores) |
| **Epochs** | 1 |
| **Effective batch size** | 64 (4 per-core × 2 grad accum × 8 cores) |
| **Learning rate** | 2e-5 (cosine decay) |
| **Optimizer** | AdamW (weight_decay=0.01) |
| **Training time** | ~1150 seconds |

### Bit-Width Schedule

```
Epoch:  0.0 ──── 0.1 ──────────── 0.9 ──── 1.0
Bits:   FP32      │    Linear     │   INT4
        (warmup)  │   32→16→8→4   │  (stabilize)
```

| Phase | Epoch Range | Bit-width |
|-------|------------|-----------|
| Warmup | 0.0 → 0.1 | FP32 (no quantization noise) |
| Linear reduction | 0.1 → 0.9 | 32 → 16 → 8 → 4 (gradual) |
| Stabilization | 0.9 → 1.0 | INT4 (final fine-tuning) |

### QAT Training Results (WikiText-103 Test)

| Metric | Value |
|--------|-------|
| Test loss | 3.0392 |
| Test perplexity | 20.89 |

## How It Works

1. **Training (QAT):** Model weights are trained with fake quantization nodes that simulate INT4 rounding noise in every forward pass. The gradients learn to place weights near quantization grid points.
2. **Export (this repo):** The QAT-hardened bf16 weights are converted to GGUF format and quantized to actual INT4/INT8 using `llama.cpp`'s `llama-quantize`.
3. **Deployment:** The GGUF files run directly on edge devices via `llama.cpp` — Android, iOS, Raspberry Pi.

Because QAT pre-adapted the weights for quantization, these GGUF files should retain more quality than naively quantized (PTQ) versions of the same model.

## Usage

### With llama.cpp CLI

```bash
# Download
wget https://huggingface.co/jpcurada/SmolLM2-1.7B-Scheduled-QAT-Linear-GGUF/resolve/main/smollm2-1.7b-sched-qat-linear-Q4_K_M.gguf

# Run
./llama-cli -m smollm2-1.7b-sched-qat-linear-Q4_K_M.gguf \
    -p "The future of artificial intelligence is" -n 100
```

### With llama-cpp-python

```python
from llama_cpp import Llama

llm = Llama(model_path="smollm2-1.7b-sched-qat-linear-Q4_K_M.gguf")
output = llm("The future of AI is", max_tokens=100)
print(output["choices"][0]["text"])
```

## Related

- **bf16 weights (pre-quantization):** [jpcurada/SmolLM2-1.7B-Scheduled-QAT-Linear-INT4](https://huggingface.co/jpcurada/SmolLM2-1.7B-Scheduled-QAT-Linear-INT4)
- **Base model:** [HuggingFaceTB/SmolLM2-1.7B](https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B)

## Citation

This model is part of a thesis on Scheduled Quantization-Aware Training for Small Language Models targeting edge deployment.

## License

Apache 2.0 (same as base model)