--- license: apache-2.0 base_model: Qwen/Qwen3-4B-Instruct-2507 library_name: peft tags: - peft - lora - sft - trl - qwen3 - hardware - eu-ai-act - art-52 - art-53 - gpai-fine-tune language: - fr - en pipeline_tag: text-generation --- # Model Card for qwen3-4b-mascarade-emc-lora This model is a fine-tuned version of [Qwen/Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B). It has been trained using [TRL](https://github.com/huggingface/trl) with SFT on an EMC compliance corpus as part of the Ailiance mascarade LoRA family. ## Quick start ```python from transformers import pipeline question = "What decoupling capacitor strategy minimizes conducted emissions on a switching regulator?" generator = pipeline("text-generation", model="Ailiance-fr/qwen3-4b-mascarade-emc-lora", device="cuda") output = generator([{"role": "user", "content": question}], max_new_tokens=256, return_full_text=False)[0] print(output["generated_text"]) ``` ## Bench results — ailiance-bench Phase 7 (CUDA, 2026-05-11) Functional eval via the parsers/scorers from [`ailiance/ailiance-bench`](https://github.com/ailiance/ailiance-bench) Phase 1 (`bench_kicad_functional`), ported to CUDA / transformers + PEFT for the Qwen3-4B-Instruct-2507 base. | Dataset | n | Composite score | Duration | |---|---:|---:|---:| | `emc-dsp-power` | 10 | **0.646** | 1221.9s | _Composite score combines structural-parse-ok, component-count match, ground-node presence, etc. — see `bench_kicad_functional.score_*` for the exact formula. Greedy decoding, max_tokens per `GEN_PARAMS`._ ## Upstream base model — official evaluations These are the **official scores** for the unmodified base model [`Qwen/Qwen3-4B-Instruct-2507`](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507), reported by Alibaba Qwen team. They represent the floor of capability that this LoRA inherits before the hardware-domain fine-tune adapts behavior. | Category | Benchmark | Qwen3-4B-Instruct-2507 | |---------------|-------------------------|-----------------------:| | Knowledge | MMLU-Pro | 69.6 | | Knowledge | MMLU-Redux | 84.2 | | Knowledge | GPQA | 62.0 | | Knowledge | SuperGPQA | 42.8 | | Reasoning | AIME25 | 47.4 | | Reasoning | HMMT25 | 31.0 | | Reasoning | ZebraLogic | 80.2 | | Reasoning | LiveBench 2024-11-25 | 63.0 | | Coding | LiveCodeBench v6 | 35.1 | | Coding | MultiPL-E | 76.8 | | Coding | Aider-Polyglot | 12.9 | | Alignment | IFEval | 83.4 | | Alignment | Arena-Hard v2 | 43.4 | | Alignment | Creative Writing v3 | 83.5 | | Alignment | WritingBench | 83.4 | | Agent | BFCL-v3 | 61.9 | | Agent | TAU1-Retail | 48.7 | | Agent | TAU1-Airline | 32.0 | | Agent | TAU2-Retail | 40.4 | | Multilingual | MultiIF | 69.0 | | Multilingual | MMLU-ProX | 61.6 | | Multilingual | INCLUDE | 60.1 | | Multilingual | PolyMATH | 31.1 | Source: [official Qwen3-4B-Instruct-2507 model card](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507). > **Reading these numbers alongside the Phase 6 bench above:** the upstream > scores measure *general* capability (knowledge, reasoning, coding, alignment). > The Phase 6 deltas measure *hardware-domain* specialization (KiCad, SPICE, > schematic extraction). A rank-16 LoRA adapter modifies less than 1% of base > weights, so the upstream scores remain approximately the floor — this LoRA > adds the Phase 6 deltas on top of these inherited capabilities. ## Training procedure This model was trained with SFT on an EMC compliance corpus. ### Framework versions - TRL: 1.4.0 - Transformers: 5.8.0 - Pytorch: 2.11.0 - Datasets: 4.8.5 - Tokenizers: 0.22.2 ## Bench results — held-out token-overlap (eval_mascarade_lora, n=10) Evaluated on 10 random held-out prompts from `Ailiance-fr/mascarade-emc-dataset` (seed=101 ≠ train seed 42). | Metric | Value | |---|---:| | Avg Jaccard token-overlap | **0.06** | | Avg generation tokens | 143.5 | | Avg latency (per sample, RTX 4090) | 7.8s | _Token-overlap is a coarse quality proxy — high overlap (>0.4) suggests the LoRA reproduces domain vocabulary; low overlap indicates either domain-shift or stylistic divergence from the reference. See `ailiance/ailiance-bench` for richer functional evaluations (KiCad DRC, SPICE convergence, etc.) on the same family._ ## Citations ```bibtex @software{vonwerra2020trl, title = {{TRL: Transformers Reinforcement Learning}}, author = {von Werra, Leandro and Belkada, Younes and Tunstall, Lewis and Beeching, Edward and Thrush, Tristan and Lambert, Nathan and Huang, Shengyi and Rasul, Kashif and Gallouédec, Quentin}, license = {Apache-2.0}, url = {https://github.com/huggingface/trl}, year = {2020} } ``` ## Bench (vs base Qwen3-4B) Consolidated comparison of this LoRA against its base, drawing on two complementary evaluation streams. The reference base used for cross-adapter comparison in Phase 6 is `gemma-e4b-eu-kiki-base` (legacy Gemma-4 ancestor). A dedicated Qwen3-4B-Instruct-2507 baseline run is not in our pipeline yet — those rows are `n/a`. ### Phase 6 — cross-adapter scoreboard (reference base: `gemma-e4b-eu-kiki-base`) | Phase | iact-bench task | Base | Tuned (+mascarade) | Δ | |---|---|--:|--:|--:| | P3 | kicad-sch-extract (cross-domain) | 0.308 | 0.785 | ++0.477 | ### Phase 7 — CUDA functional eval on Qwen3-4B base (production-aligned) | Dataset | n | Base (Qwen3-4B) | Tuned (this LoRA) | Δ | |---|--:|--:|--:|--:| | `emc-dsp-power` | 10 | n/a | **0.646** | n/a | **Methodology:** iact-bench v0.2.0 (audit-grade Docker validators), greedy decoding, max_tokens per `GEN_PARAMS`. NDJSON audit trail in [`ailiance/ailiance-bench`](https://github.com/ailiance/ailiance-bench). **Scoring date:** 2026-05-11 (commit [`46801af`](https://github.com/ailiance/ailiance-bench/commit/46801af)). > Phase 6 numbers reflect adapter behavior on a Gemma-4 reference base; domain semantics transfer to the Qwen3-4B production base served via Tower Ollama `:8004`, but absolute scores may shift. A Qwen3-4B baseline run is tracked for a future bench refresh. ## Cross-domain forgetting check (Phase 9, 2026-05-11) For each domain's eval set (seed=101, n samples held-out), compare this LoRA's Jaccard token-overlap vs the Qwen3-4B-Instruct-2507 **baseline (no adapter)** on the SAME prompts. Negative Δ = the LoRA degrades base behaviour on that domain. | Eval domain | LoRA Jaccard | Δ vs base | |---|---:|---:| | `kicad` | 0.09 | +0.003 | | `spice` | 0.007 | +0.002 | | `stm32` | 0.055 | +0.005 | | `emc` | 0.061 | -0.005 ⬅ in-domain | | `embedded` | 0.072 | -0.002 | | `platformio` | 0.047 | +0.005 | | `freecad` | 0.03 | +0.009 | | `dsp` | 0.098 | -0.003 | | `iot` | 0.05 | -0.018 | | `power` | 0.078 | +0.010 | **In-domain Δ**: -0.005 **Out-of-domain mean Δ**: 0.001