---
base_model: unsloth/Qwen3-Coder-30B-A3B-Instruct
library_name: peft
pipeline_tag: text-generation
license: apache-2.0
language:
  - en
tags:
  - security
  - cve
  - patches
  - backporting
  - opensuse
  - suse
  - linux
  - code-generation
  - lora
  - qlora
  - moe
  - mixture-of-experts
  - qwen3
  - unsloth
datasets:
  - anicka/cve-backport-codegen-dataset
model-index:
  - name: cve-backport-codegen-v5-qwen3-coder-30b-a3b
    results:
      - task:
          type: text-generation
          name: Security Patch Backporting
        dataset:
          type: anicka/cve-backport-codegen-dataset
          name: CVE Backport Codegen Dataset
        metrics:
          - name: Recall
            type: recall
            value: 0.919
          - name: Precision
            type: precision
            value: 0.916
          - name: Exact Match
            type: exact_match
            value: 0.87
---

# CVE Backport Codegen v5 — Qwen3-Coder-30B-A3B QLoRA (MoE)

Fine-tuned code generation model for backporting upstream CVE security fixes
to older SUSE/openSUSE package versions. Given vulnerable source code and an
upstream fix description, the model outputs the corrected code. A separate
tool then diffs the output against the original to produce a patch.

This is the **MoE sibling** of the dense v5 model, available at
[openSUSE/CVE-Backport-Qwen2.5-Coder-32B](https://huggingface.co/openSUSE/CVE-Backport-Qwen2.5-Coder-32B)
(and mirrored at [anicka/cve-backport-codegen-v5-qwen25-32b](https://huggingface.co/anicka/cve-backport-codegen-v5-qwen25-32b)).
Same dataset, same task, same recall target — but built on Qwen3-Coder-30B-A3B's
sparse Mixture-of-Experts architecture (30B total params, 3B active, 128 experts,
top-8 routing). The result is **equivalent quality with roughly 10× faster
inference** because generation touches only ~3B parameters per token.

## Why MoE?

v5 Qwen2.5-Coder-32B dense works well (93.1% recall on n=100) but is slow
to serve in batch CVE backport workflows. Qwen3-Coder-30B-A3B offers the
same code specialization with sparse activation, which is a big deal when
you need to process hundreds of CVEs in a maintenance cycle.

The open question was whether MoE would train cleanly under QLoRA on a
single GPU. It does, thanks to [unsloth](https://github.com/unslothai/unsloth)'s
dedicated fused-3D expert parameter LoRA code path (PEFT's `target_parameters=`
API applied to `mlp.experts.gate_up_proj` and `mlp.experts.down_proj`), which
sidesteps the per-expert `nn.Linear` layout of older transformers while still
reaching every expert in the network.

## Evaluation

Evaluated on 100 held-out examples from the cve-backport-codegen-dataset's
official eval split, using the same diff-based recall/precision metric as
v5 dense. Inference at temperature 0, max_new_tokens 2048, via unsloth's
FastLanguageModel on a single H100 NVL (split across both H100s via two
eval workers for wall-time efficiency).

### Overall (n=100)

| Metric | **Qwen3-Coder MoE (this model)** | v5 dense reference |
|--------|:---:|:---:|
| Avg recall | **91.9%** | 93.1% |
| Avg precision | **91.6%** | 94.4% |
| **Exact match** | **87/100** | 83/100 |
| Perfect (recall ≥ 95%) | 90/100 | 90/100 |
| Failures (recall < 10%) | 5/100 | 3/100 |

**Same apples-to-apples n=100 methodology as v5.** Recall is 1.2 pt below
v5 dense, precision is 2.8 pt below, but **exact-match count is actually
higher** (87 vs 83) — the MoE model nails more patches character-for-character
even though it has slightly more near-misses that cost a few recall points
overall.

### By Tier (n=100)

| Tier | Count | **MoE recall** | v5 dense recall |
|------|:-----:|:--------------:|:---------------:|
| **Identical** (upstream applies as-is) | 85 | 92.1% | 93.7% |
| **Adapted** (requires modification) | 15 | **90.3%** | 90.0% |

**Adapted tier is a statistical tie** with v5 dense — the MoE model is
marginally ahead on the harder tier where structural reasoning matters
most. Identical tier is 1.6 pt behind.

### The Training Trajectory (n=20 instrumentation)

During training we instrumented a separate n=20 eval at two intermediate
checkpoints to understand what fine-tuning actually does on a pretrained
code MoE. The n=20 set is a subsample of the n=100 eval (same sampling
step) so mid-training numbers are directly comparable to the n=20 slice
of the final n=100 result.

| Stage | n  | Recall | Precision | Exact | Failures |
|-------|:-:|:------:|:---------:|:-----:|:--------:|
| **Base model** (no fine-tuning) | 20 | 19.8% | 15.8% | 0/20 | 11/20 |
| **Step 2800** (31% training) | 20 | 59.4% | 62.1% | 7/20 | 6/20 |
| **Step 9042** (final, n=20 slice) | 20 | 90.0% | 90.0% | 18/20 | 2/20 |
| **Step 9042** (final, full n=100) | **100** | **91.9%** | **91.6%** | **87/100** | **5/100** |

**The base model starts at 19.8% recall.** Despite a low teacher-forced
training loss even in early steps, autoregressive generation is poor
because the base model doesn't know this task's output convention (bare
code, no commentary, no markdown, no explanations). Fine-tuning's first
job is to teach that convention, which is visible in the precision
column: base precision 16% → mid 62% → final 92%. The precision jump
from 16 to 62 in the first 31% of training is almost entirely "stop
rambling"; the second half is "find all the changes reliably."

**3 of the 11 baseline failures recovered to perfect scores by the final
step** — examples where both the base model and the mid-training
checkpoint emitted zero correct changes, but the fully-trained model
produces the exact patch.

### Failure Analysis

The 5 remaining zero-recall cases at n=100 (2 more than v5 dense) are
all on the identical tier and exhibit the same pattern: the model emits
output that doesn't relate to the expected patch region at all. Likely
causes: unusual patch structure, extremely long source context, or
function signatures the base model tokenizes in a way that decouples
generation from the input. These are candidates for an agentic retry
loop with error feedback.

## Model Details

| | |
|---|---|
| **Base model** | [unsloth/Qwen3-Coder-30B-A3B-Instruct](https://huggingface.co/unsloth/Qwen3-Coder-30B-A3B-Instruct) |
| **Architecture** | Qwen3MoeForCausalLM (30B total / 3B active, 128 experts, top-8 routing) |
| **Method** | QLoRA via [unsloth](https://github.com/unslothai/unsloth) (4-bit NF4, double quantization, bf16 compute) |
| **LoRA rank / alpha** | 16 / 32 |
| **LoRA dropout** | 0 (required for LoRA on raw `nn.Parameter` tensors via PEFT's `target_parameters`) |
| **LoRA targets (user-facing)** | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| **LoRA targets (actual, after MoE expansion)** | attention Linears **+** `mlp.experts.gate_up_proj`, `mlp.experts.down_proj` (fused 3D tensors per layer) |
| **Trainable params** | 642,514,944 (2.06% of 31.2B total) |
| **Training data** | 36,166 train / 100 eval (from the 1,834-example eval split, same sampling as v5 dense) |
| **Epochs** | 2 (9,042 optimizer steps) |
| **Effective batch size** | 8 (1 × grad_accum 8) |
| **Learning rate** | 1e-4 (cosine schedule, 5% warmup) |
| **Max sequence length** | 4,096 tokens |
| **Optimizer** | adamw_8bit |
| **Gradient checkpointing** | `unsloth` mode |
| **Hardware** | 1× NVIDIA H100 NVL 94GB |
| **Training time** | **10h 25m** (vs v5 dense: 46h on 2× H100) |
| **Peak VRAM** | ~75GB during training |
| **Final train loss** | 0.02838 |
| **unsloth version** | 2026.4.4 |
| **transformers** | 5.5.0 |
| **PEFT** | 0.18.1 |

## Files

This repository contains:

- **LoRA adapter** (`adapter_model.safetensors`, `adapter_config.json`) — ~2.5GB, apply via PEFT
- **Tokenizer** files (the model's chat template is required — Qwen3 family chat format)

## Reproduction via Teapot

This model was trained via the [teapot](https://github.com/anicka-net/teapot)
training pipeline. Reproduction is a four-command sequence once the
cve-backport dataset is prepared:

```bash
git clone https://github.com/anicka-net/teapot
cd teapot
pip install -e .
pip install unsloth  # provides FastLanguageModel + fused-3D LoRA

# 1. Compose training data from the cve-backport module
teapot compose configs/cve-backport-qwen3-coder-qlora.config \
    --output train-cve-backport-qwen3-coder.jsonl

# 2. Generate the unsloth launch script
teapot train configs/cve-backport-qwen3-coder-qlora.config \
    --backend unsloth \
    --train-data train-cve-backport-qwen3-coder.jsonl \
    --output train-cve-backport-qwen3-coder.sh

# 3. Train (single GPU; see note below on why)
CUDA_VISIBLE_DEVICES=0 bash train-cve-backport-qwen3-coder.sh

# 4. Final adapter is at
#    output-teapot-cve-backport-qwen3-coder-qlora/final/
```

The teapot config (`configs/cve-backport-qwen3-coder-qlora.config`) pins
all the hyperparameters: r=16, alpha=32, 2 epochs, lr=1e-4, max_length=4096,
batch=1, grad_accum=8. See the config file for the full declaration.

### Note on single-GPU

`hardware.gpus: 1` in the config is deliberate. Multi-GPU model parallelism
(device_map="auto") across 2× H100 NVL triggers an assertion in
`torch._higher_order_ops.flex_attention.create_fw_bw_graph` when tensors
are split across devices. The single 94GB H100 fits comfortably (peak ~75GB
during training) so this isn't a practical constraint.

## Usage

### With transformers + PEFT + unsloth (recommended)

```python
from unsloth import FastLanguageModel
from peft import PeftModel

base, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen3-Coder-30B-A3B-Instruct",
    max_seq_length=4096,
    load_in_4bit=True,
    dtype=None,
    device_map={"": 0},
    attn_implementation="sdpa",  # avoid flex_attention inference bug
)
model = PeftModel.from_pretrained(
    base, "anicka/cve-backport-codegen-v5-qwen3-coder-30b-a3b"
)
FastLanguageModel.for_inference(model)
```

### With transformers + PEFT (stock, slower)

```python
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

bnb = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)
base = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3-Coder-30B-A3B-Instruct",
    quantization_config=bnb,
    device_map={"": 0},
    attn_implementation="sdpa",
)
model = PeftModel.from_pretrained(
    base, "anicka/cve-backport-codegen-v5-qwen3-coder-30b-a3b"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-Coder-30B-A3B-Instruct")
```

### With the CVE Backport Tool

The recommended way to use this model is via the
[cve-backport-tool](https://github.com/openSUSE/cve-backport-tool),
which handles patch parsing, source extraction, model inference, and
diff generation.

### Prompt Template

The chat template is the standard Qwen3 chat format (ChatML-like).
`apply_chat_template` with the tokenizer handles this automatically.
The system prompt used during training:

```
You are a security patch backporting assistant.

Given vulnerable source code and a description of the upstream fix,
output the FIXED version of the code.

Rules:
- Output ONLY the fixed code, nothing else
- Preserve all surrounding context exactly
- Apply only the described fix
```

## Limitations

- **5 failure modes** (5/100) — examples where recall drops to <10%, all
  on the identical tier. These represent hard edge cases (unusual patch
  structure, very long source context) and likely need an agentic retry
  loop with error feedback. v5 dense has 3 such failures on the same
  eval, so this model is slightly more prone to the catastrophic-output
  failure mode.
- **Precision is ~3 pt below v5 dense**: the MoE occasionally produces
  "partial rambles" that get the right fix but also emit extra unrelated
  changes. The diff-based metric penalizes these with high recall but
  low precision. In practice the tool can filter these with a precision
  threshold.
- **No compilation feedback**: single-pass generation without verifying
  the output compiles. Use `--retry` in the CVE backport CLI tool for
  iterative correction.
- **Context window**: 4,096 token training limit. Very large functions
  or cross-file adaptations may be truncated.
- **MoE inference requires unsloth or stock transformers 5.x**, because
  the LoRA is attached to fused 3D parameter tensors in the MoE expert
  blocks. Older transformers versions (<5.0) expect per-expert `nn.Linear`
  modules and will not load this adapter correctly.
- Always review generated patches before applying to production systems.

## Related

- **Dense sibling (openSUSE)**: [openSUSE/CVE-Backport-Qwen2.5-Coder-32B](https://huggingface.co/openSUSE/CVE-Backport-Qwen2.5-Coder-32B) — v5 Qwen2.5-Coder-32B dense, 93.1% recall on n=100 (1.2 pt higher recall, but this MoE model has 4 more exact matches)
- **Dense sibling (anicka mirror)**: [anicka/cve-backport-codegen-v5-qwen25-32b](https://huggingface.co/anicka/cve-backport-codegen-v5-qwen25-32b)
- **CLI tool**: [openSUSE/cve-backport-tool](https://github.com/openSUSE/cve-backport-tool)
- **Dataset**: [anicka/cve-backport-codegen-dataset](https://huggingface.co/datasets/anicka/cve-backport-codegen-dataset)
- **Training pipeline**: [teapot](https://github.com/anicka-net/teapot)

## Citation

```bibtex
@misc{cve-backport-codegen-v5-qwen3-coder-30b-a3b,
  title={CVE Backport Codegen v5 (MoE): Fine-tuned Qwen3-Coder-30B-A3B for Security Patch Backporting},
  author={Anna Maresova},
  year={2026},
  url={https://huggingface.co/anicka/cve-backport-codegen-v5-qwen3-coder-30b-a3b}
}
```