--- base_model: unsloth/Qwen3-Coder-30B-A3B-Instruct library_name: peft pipeline_tag: text-generation license: apache-2.0 language: - en tags: - security - cve - patches - backporting - opensuse - suse - linux - code-generation - lora - qlora - moe - mixture-of-experts - qwen3 - unsloth datasets: - anicka/cve-backport-codegen-dataset model-index: - name: cve-backport-codegen-v5-qwen3-coder-30b-a3b results: - task: type: text-generation name: Security Patch Backporting dataset: type: anicka/cve-backport-codegen-dataset name: CVE Backport Codegen Dataset metrics: - name: Recall type: recall value: 0.919 - name: Precision type: precision value: 0.916 - name: Exact Match type: exact_match value: 0.87 --- # CVE Backport Codegen v5 — Qwen3-Coder-30B-A3B QLoRA (MoE) Fine-tuned code generation model for backporting upstream CVE security fixes to older SUSE/openSUSE package versions. Given vulnerable source code and an upstream fix description, the model outputs the corrected code. A separate tool then diffs the output against the original to produce a patch. This is the **MoE sibling** of the dense v5 model, available at [openSUSE/CVE-Backport-Qwen2.5-Coder-32B](https://huggingface.co/openSUSE/CVE-Backport-Qwen2.5-Coder-32B) (and mirrored at [anicka/cve-backport-codegen-v5-qwen25-32b](https://huggingface.co/anicka/cve-backport-codegen-v5-qwen25-32b)). Same dataset, same task, same recall target — but built on Qwen3-Coder-30B-A3B's sparse Mixture-of-Experts architecture (30B total params, 3B active, 128 experts, top-8 routing). The result is **equivalent quality with roughly 10× faster inference** because generation touches only ~3B parameters per token. ## Why MoE? v5 Qwen2.5-Coder-32B dense works well (93.1% recall on n=100) but is slow to serve in batch CVE backport workflows. Qwen3-Coder-30B-A3B offers the same code specialization with sparse activation, which is a big deal when you need to process hundreds of CVEs in a maintenance cycle. The open question was whether MoE would train cleanly under QLoRA on a single GPU. It does, thanks to [unsloth](https://github.com/unslothai/unsloth)'s dedicated fused-3D expert parameter LoRA code path (PEFT's `target_parameters=` API applied to `mlp.experts.gate_up_proj` and `mlp.experts.down_proj`), which sidesteps the per-expert `nn.Linear` layout of older transformers while still reaching every expert in the network. ## Evaluation Evaluated on 100 held-out examples from the cve-backport-codegen-dataset's official eval split, using the same diff-based recall/precision metric as v5 dense. Inference at temperature 0, max_new_tokens 2048, via unsloth's FastLanguageModel on a single H100 NVL (split across both H100s via two eval workers for wall-time efficiency). ### Overall (n=100) | Metric | **Qwen3-Coder MoE (this model)** | v5 dense reference | |--------|:---:|:---:| | Avg recall | **91.9%** | 93.1% | | Avg precision | **91.6%** | 94.4% | | **Exact match** | **87/100** | 83/100 | | Perfect (recall ≥ 95%) | 90/100 | 90/100 | | Failures (recall < 10%) | 5/100 | 3/100 | **Same apples-to-apples n=100 methodology as v5.** Recall is 1.2 pt below v5 dense, precision is 2.8 pt below, but **exact-match count is actually higher** (87 vs 83) — the MoE model nails more patches character-for-character even though it has slightly more near-misses that cost a few recall points overall. ### By Tier (n=100) | Tier | Count | **MoE recall** | v5 dense recall | |------|:-----:|:--------------:|:---------------:| | **Identical** (upstream applies as-is) | 85 | 92.1% | 93.7% | | **Adapted** (requires modification) | 15 | **90.3%** | 90.0% | **Adapted tier is a statistical tie** with v5 dense — the MoE model is marginally ahead on the harder tier where structural reasoning matters most. Identical tier is 1.6 pt behind. ### The Training Trajectory (n=20 instrumentation) During training we instrumented a separate n=20 eval at two intermediate checkpoints to understand what fine-tuning actually does on a pretrained code MoE. The n=20 set is a subsample of the n=100 eval (same sampling step) so mid-training numbers are directly comparable to the n=20 slice of the final n=100 result. | Stage | n | Recall | Precision | Exact | Failures | |-------|:-:|:------:|:---------:|:-----:|:--------:| | **Base model** (no fine-tuning) | 20 | 19.8% | 15.8% | 0/20 | 11/20 | | **Step 2800** (31% training) | 20 | 59.4% | 62.1% | 7/20 | 6/20 | | **Step 9042** (final, n=20 slice) | 20 | 90.0% | 90.0% | 18/20 | 2/20 | | **Step 9042** (final, full n=100) | **100** | **91.9%** | **91.6%** | **87/100** | **5/100** | **The base model starts at 19.8% recall.** Despite a low teacher-forced training loss even in early steps, autoregressive generation is poor because the base model doesn't know this task's output convention (bare code, no commentary, no markdown, no explanations). Fine-tuning's first job is to teach that convention, which is visible in the precision column: base precision 16% → mid 62% → final 92%. The precision jump from 16 to 62 in the first 31% of training is almost entirely "stop rambling"; the second half is "find all the changes reliably." **3 of the 11 baseline failures recovered to perfect scores by the final step** — examples where both the base model and the mid-training checkpoint emitted zero correct changes, but the fully-trained model produces the exact patch. ### Failure Analysis The 5 remaining zero-recall cases at n=100 (2 more than v5 dense) are all on the identical tier and exhibit the same pattern: the model emits output that doesn't relate to the expected patch region at all. Likely causes: unusual patch structure, extremely long source context, or function signatures the base model tokenizes in a way that decouples generation from the input. These are candidates for an agentic retry loop with error feedback. ## Model Details | | | |---|---| | **Base model** | [unsloth/Qwen3-Coder-30B-A3B-Instruct](https://huggingface.co/unsloth/Qwen3-Coder-30B-A3B-Instruct) | | **Architecture** | Qwen3MoeForCausalLM (30B total / 3B active, 128 experts, top-8 routing) | | **Method** | QLoRA via [unsloth](https://github.com/unslothai/unsloth) (4-bit NF4, double quantization, bf16 compute) | | **LoRA rank / alpha** | 16 / 32 | | **LoRA dropout** | 0 (required for LoRA on raw `nn.Parameter` tensors via PEFT's `target_parameters`) | | **LoRA targets (user-facing)** | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj | | **LoRA targets (actual, after MoE expansion)** | attention Linears **+** `mlp.experts.gate_up_proj`, `mlp.experts.down_proj` (fused 3D tensors per layer) | | **Trainable params** | 642,514,944 (2.06% of 31.2B total) | | **Training data** | 36,166 train / 100 eval (from the 1,834-example eval split, same sampling as v5 dense) | | **Epochs** | 2 (9,042 optimizer steps) | | **Effective batch size** | 8 (1 × grad_accum 8) | | **Learning rate** | 1e-4 (cosine schedule, 5% warmup) | | **Max sequence length** | 4,096 tokens | | **Optimizer** | adamw_8bit | | **Gradient checkpointing** | `unsloth` mode | | **Hardware** | 1× NVIDIA H100 NVL 94GB | | **Training time** | **10h 25m** (vs v5 dense: 46h on 2× H100) | | **Peak VRAM** | ~75GB during training | | **Final train loss** | 0.02838 | | **unsloth version** | 2026.4.4 | | **transformers** | 5.5.0 | | **PEFT** | 0.18.1 | ## Files This repository contains: - **LoRA adapter** (`adapter_model.safetensors`, `adapter_config.json`) — ~2.5GB, apply via PEFT - **Tokenizer** files (the model's chat template is required — Qwen3 family chat format) ## Reproduction via Teapot This model was trained via the [teapot](https://github.com/anicka-net/teapot) training pipeline. Reproduction is a four-command sequence once the cve-backport dataset is prepared: ```bash git clone https://github.com/anicka-net/teapot cd teapot pip install -e . pip install unsloth # provides FastLanguageModel + fused-3D LoRA # 1. Compose training data from the cve-backport module teapot compose configs/cve-backport-qwen3-coder-qlora.config \ --output train-cve-backport-qwen3-coder.jsonl # 2. Generate the unsloth launch script teapot train configs/cve-backport-qwen3-coder-qlora.config \ --backend unsloth \ --train-data train-cve-backport-qwen3-coder.jsonl \ --output train-cve-backport-qwen3-coder.sh # 3. Train (single GPU; see note below on why) CUDA_VISIBLE_DEVICES=0 bash train-cve-backport-qwen3-coder.sh # 4. Final adapter is at # output-teapot-cve-backport-qwen3-coder-qlora/final/ ``` The teapot config (`configs/cve-backport-qwen3-coder-qlora.config`) pins all the hyperparameters: r=16, alpha=32, 2 epochs, lr=1e-4, max_length=4096, batch=1, grad_accum=8. See the config file for the full declaration. ### Note on single-GPU `hardware.gpus: 1` in the config is deliberate. Multi-GPU model parallelism (device_map="auto") across 2× H100 NVL triggers an assertion in `torch._higher_order_ops.flex_attention.create_fw_bw_graph` when tensors are split across devices. The single 94GB H100 fits comfortably (peak ~75GB during training) so this isn't a practical constraint. ## Usage ### With transformers + PEFT + unsloth (recommended) ```python from unsloth import FastLanguageModel from peft import PeftModel base, tokenizer = FastLanguageModel.from_pretrained( model_name="unsloth/Qwen3-Coder-30B-A3B-Instruct", max_seq_length=4096, load_in_4bit=True, dtype=None, device_map={"": 0}, attn_implementation="sdpa", # avoid flex_attention inference bug ) model = PeftModel.from_pretrained( base, "anicka/cve-backport-codegen-v5-qwen3-coder-30b-a3b" ) FastLanguageModel.for_inference(model) ``` ### With transformers + PEFT (stock, slower) ```python from peft import PeftModel from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig import torch bnb = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True, ) base = AutoModelForCausalLM.from_pretrained( "Qwen/Qwen3-Coder-30B-A3B-Instruct", quantization_config=bnb, device_map={"": 0}, attn_implementation="sdpa", ) model = PeftModel.from_pretrained( base, "anicka/cve-backport-codegen-v5-qwen3-coder-30b-a3b" ) tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-Coder-30B-A3B-Instruct") ``` ### With the CVE Backport Tool The recommended way to use this model is via the [cve-backport-tool](https://github.com/openSUSE/cve-backport-tool), which handles patch parsing, source extraction, model inference, and diff generation. ### Prompt Template The chat template is the standard Qwen3 chat format (ChatML-like). `apply_chat_template` with the tokenizer handles this automatically. The system prompt used during training: ``` You are a security patch backporting assistant. Given vulnerable source code and a description of the upstream fix, output the FIXED version of the code. Rules: - Output ONLY the fixed code, nothing else - Preserve all surrounding context exactly - Apply only the described fix ``` ## Limitations - **5 failure modes** (5/100) — examples where recall drops to <10%, all on the identical tier. These represent hard edge cases (unusual patch structure, very long source context) and likely need an agentic retry loop with error feedback. v5 dense has 3 such failures on the same eval, so this model is slightly more prone to the catastrophic-output failure mode. - **Precision is ~3 pt below v5 dense**: the MoE occasionally produces "partial rambles" that get the right fix but also emit extra unrelated changes. The diff-based metric penalizes these with high recall but low precision. In practice the tool can filter these with a precision threshold. - **No compilation feedback**: single-pass generation without verifying the output compiles. Use `--retry` in the CVE backport CLI tool for iterative correction. - **Context window**: 4,096 token training limit. Very large functions or cross-file adaptations may be truncated. - **MoE inference requires unsloth or stock transformers 5.x**, because the LoRA is attached to fused 3D parameter tensors in the MoE expert blocks. Older transformers versions (<5.0) expect per-expert `nn.Linear` modules and will not load this adapter correctly. - Always review generated patches before applying to production systems. ## Related - **Dense sibling (openSUSE)**: [openSUSE/CVE-Backport-Qwen2.5-Coder-32B](https://huggingface.co/openSUSE/CVE-Backport-Qwen2.5-Coder-32B) — v5 Qwen2.5-Coder-32B dense, 93.1% recall on n=100 (1.2 pt higher recall, but this MoE model has 4 more exact matches) - **Dense sibling (anicka mirror)**: [anicka/cve-backport-codegen-v5-qwen25-32b](https://huggingface.co/anicka/cve-backport-codegen-v5-qwen25-32b) - **CLI tool**: [openSUSE/cve-backport-tool](https://github.com/openSUSE/cve-backport-tool) - **Dataset**: [anicka/cve-backport-codegen-dataset](https://huggingface.co/datasets/anicka/cve-backport-codegen-dataset) - **Training pipeline**: [teapot](https://github.com/anicka-net/teapot) ## Citation ```bibtex @misc{cve-backport-codegen-v5-qwen3-coder-30b-a3b, title={CVE Backport Codegen v5 (MoE): Fine-tuned Qwen3-Coder-30B-A3B for Security Patch Backporting}, author={Anna Maresova}, year={2026}, url={https://huggingface.co/anicka/cve-backport-codegen-v5-qwen3-coder-30b-a3b} } ```