---
license: apache-2.0
base_model: Qwen/Qwen3-4B-Instruct-2507
library_name: peft
tags:
  - peft
  - lora
  - sft
  - trl
  - qwen3
  - hardware
  - eu-ai-act
  - art-52
  - art-53
  - gpai-fine-tune
language:
  - fr
  - en
pipeline_tag: text-generation
---

# Model Card for qwen3-4b-mascarade-emc-lora

This model is a fine-tuned version of [Qwen/Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B).
It has been trained using [TRL](https://github.com/huggingface/trl) with SFT on an
EMC compliance corpus as part of the Ailiance mascarade LoRA family.

## Quick start

```python
from transformers import pipeline

question = "What decoupling capacitor strategy minimizes conducted emissions on a switching regulator?"
generator = pipeline("text-generation", model="Ailiance-fr/qwen3-4b-mascarade-emc-lora", device="cuda")
output = generator([{"role": "user", "content": question}], max_new_tokens=256, return_full_text=False)[0]
print(output["generated_text"])
```

## Bench results — ailiance-bench Phase 7 (CUDA, 2026-05-11)

Functional eval via the parsers/scorers from [`ailiance/ailiance-bench`](https://github.com/ailiance/ailiance-bench) Phase 1 (`bench_kicad_functional`), ported to CUDA / transformers + PEFT for the Qwen3-4B-Instruct-2507 base.

| Dataset | n | Composite score | Duration |
|---|---:|---:|---:|
| `emc-dsp-power` | 10 | **0.646** | 1221.9s |

_Composite score combines structural-parse-ok, component-count match, ground-node presence, etc. — see `bench_kicad_functional.score_*` for the exact formula. Greedy decoding, max_tokens per `GEN_PARAMS`._

## Upstream base model — official evaluations

These are the **official scores** for the unmodified base model
[`Qwen/Qwen3-4B-Instruct-2507`](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507),
reported by Alibaba Qwen team. They represent the floor of capability that this
LoRA inherits before the hardware-domain fine-tune adapts behavior.

| Category      | Benchmark               | Qwen3-4B-Instruct-2507 |
|---------------|-------------------------|-----------------------:|
| Knowledge     | MMLU-Pro                | 69.6                   |
| Knowledge     | MMLU-Redux              | 84.2                   |
| Knowledge     | GPQA                    | 62.0                   |
| Knowledge     | SuperGPQA               | 42.8                   |
| Reasoning     | AIME25                  | 47.4                   |
| Reasoning     | HMMT25                  | 31.0                   |
| Reasoning     | ZebraLogic              | 80.2                   |
| Reasoning     | LiveBench 2024-11-25    | 63.0                   |
| Coding        | LiveCodeBench v6        | 35.1                   |
| Coding        | MultiPL-E               | 76.8                   |
| Coding        | Aider-Polyglot          | 12.9                   |
| Alignment     | IFEval                  | 83.4                   |
| Alignment     | Arena-Hard v2           | 43.4                   |
| Alignment     | Creative Writing v3     | 83.5                   |
| Alignment     | WritingBench            | 83.4                   |
| Agent         | BFCL-v3                 | 61.9                   |
| Agent         | TAU1-Retail             | 48.7                   |
| Agent         | TAU1-Airline            | 32.0                   |
| Agent         | TAU2-Retail             | 40.4                   |
| Multilingual  | MultiIF                 | 69.0                   |
| Multilingual  | MMLU-ProX               | 61.6                   |
| Multilingual  | INCLUDE                 | 60.1                   |
| Multilingual  | PolyMATH                | 31.1                   |

Source: [official Qwen3-4B-Instruct-2507 model card](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507).

> **Reading these numbers alongside the Phase 6 bench above:** the upstream
> scores measure *general* capability (knowledge, reasoning, coding, alignment).
> The Phase 6 deltas measure *hardware-domain* specialization (KiCad, SPICE,
> schematic extraction). A rank-16 LoRA adapter modifies less than 1% of base
> weights, so the upstream scores remain approximately the floor — this LoRA
> adds the Phase 6 deltas on top of these inherited capabilities.

## Training procedure

This model was trained with SFT on an EMC compliance corpus.

### Framework versions

- TRL: 1.4.0
- Transformers: 5.8.0
- Pytorch: 2.11.0
- Datasets: 4.8.5
- Tokenizers: 0.22.2

## Bench results — held-out token-overlap (eval_mascarade_lora, n=10)

Evaluated on 10 random held-out prompts from `Ailiance-fr/mascarade-emc-dataset` (seed=101 ≠ train seed 42).

| Metric | Value |
|---|---:|
| Avg Jaccard token-overlap | **0.06** |
| Avg generation tokens | 143.5 |
| Avg latency (per sample, RTX 4090) | 7.8s |

_Token-overlap is a coarse quality proxy — high overlap (>0.4) suggests the LoRA reproduces domain vocabulary; low overlap indicates either domain-shift or stylistic divergence from the reference. See `ailiance/ailiance-bench` for richer functional evaluations (KiCad DRC, SPICE convergence, etc.) on the same family._

## Citations

```bibtex
@software{vonwerra2020trl,
  title   = {{TRL: Transformers Reinforcement Learning}},
  author  = {von Werra, Leandro and Belkada, Younes and Tunstall, Lewis and Beeching, Edward and Thrush, Tristan and Lambert, Nathan and Huang, Shengyi and Rasul, Kashif and Gallouédec, Quentin},
  license = {Apache-2.0},
  url     = {https://github.com/huggingface/trl},
  year    = {2020}
}
```


## Bench (vs base Qwen3-4B)

Consolidated comparison of this LoRA against its base, drawing on two complementary evaluation streams. The reference base used for cross-adapter comparison in Phase 6 is `gemma-e4b-eu-kiki-base` (legacy Gemma-4 ancestor). A dedicated Qwen3-4B-Instruct-2507 baseline run is not in our pipeline yet — those rows are `n/a`.

### Phase 6 — cross-adapter scoreboard (reference base: `gemma-e4b-eu-kiki-base`)

| Phase | iact-bench task | Base | Tuned (+mascarade) | Δ |
|---|---|--:|--:|--:|
| P3 | kicad-sch-extract (cross-domain) | 0.308 | 0.785 | ++0.477 |

### Phase 7 — CUDA functional eval on Qwen3-4B base (production-aligned)

| Dataset | n | Base (Qwen3-4B) | Tuned (this LoRA) | Δ |
|---|--:|--:|--:|--:|
| `emc-dsp-power` | 10 | n/a | **0.646** | n/a |

**Methodology:** iact-bench v0.2.0 (audit-grade Docker validators), greedy decoding, max_tokens per `GEN_PARAMS`. NDJSON audit trail in [`ailiance/ailiance-bench`](https://github.com/ailiance/ailiance-bench). **Scoring date:** 2026-05-11 (commit [`46801af`](https://github.com/ailiance/ailiance-bench/commit/46801af)).

> Phase 6 numbers reflect adapter behavior on a Gemma-4 reference base; domain semantics transfer to the Qwen3-4B production base served via Tower Ollama `:8004`, but absolute scores may shift. A Qwen3-4B baseline run is tracked for a future bench refresh.

## Cross-domain forgetting check (Phase 9, 2026-05-11)

For each domain's eval set (seed=101, n samples held-out), compare this LoRA's Jaccard token-overlap vs the Qwen3-4B-Instruct-2507 **baseline (no adapter)** on the SAME prompts. Negative Δ = the LoRA degrades base behaviour on that domain.

| Eval domain | LoRA Jaccard | Δ vs base |
|---|---:|---:|
| `kicad` | 0.09 | +0.003 |
| `spice` | 0.007 | +0.002 |
| `stm32` | 0.055 | +0.005 |
| `emc` | 0.061 | -0.005 ⬅ in-domain |
| `embedded` | 0.072 | -0.002 |
| `platformio` | 0.047 | +0.005 |
| `freecad` | 0.03 | +0.009 |
| `dsp` | 0.098 | -0.003 |
| `iot` | 0.05 | -0.018 |
| `power` | 0.078 | +0.010 |

**In-domain Δ**: -0.005  **Out-of-domain mean Δ**: 0.001