Gemma-4 E4B Audio — v3 (Stage 2 DoRA, merged)
Single-stage DoRA fine-tune of google/gemma-4-e4b-it on audio-QA data.
Merged adapter, no external PEFT runtime required. Beats the base model by
+3.8pp on MMAU test_mini (1000 samples, same pipeline).
The primary contribution is a negative finding: two-stage recipes that first re-train the audio encoder on ~100k mixed-domain captions are worse than doing nothing to the encoder and only DoRA-adapting on task-specific audio QA. See "Ablation" below.
Results (MMAU test_mini, 1000 samples, identical pipeline)
| Model | Overall |
|---|---|
Base google/gemma-4-e4b-it |
55.0% |
| This model (v3 @ step 200) | 58.8% |
Per-subcategory Δ (this model − base)
Wins:
| Subcategory | Base | v3 | Δ |
|---|---|---|---|
| Emotion State Summarisation | 45.0% | 60.0% | +15.0 |
| Event-Based Knowledge Retrieval | 75.0% | 90.0% | +15.0 |
| Ambient Sound Interpretation | 14.6% | 27.1% | +12.5 |
| Acoustic Scene Reasoning | 41.7% | 52.1% | +10.4 |
| Phonemic Stress Pattern Analysis | 33.3% | 43.1% | +9.8 |
| Event-Based Sound Reasoning | 66.7% | 75.0% | +8.3 |
| Temporal Event Reasoning | 27.1% | 33.3% | +6.2 |
| Emotional Tone Interpretation | 69.7% | 75.8% | +6.1 |
| Dissonant Emotion Interpretation | 90.0% | 95.0% | +5.0 |
| Socio-cultural Interpretation | 80.0% | 85.0% | +5.0 |
Neutral (within ±3pp): Conversational Fact Retrieval, Counting, Harmony, Lyrical Reasoning, Multi-Speaker Role Mapping, Sound-Based Event Recognition, Musical Genre Reasoning, Rhythm and Tempo, Acoustic Source Inference.
Regressions (disclosed honestly):
| Subcategory | Base | v3 | Δ |
|---|---|---|---|
| Instrumentation | 60.0% | 57.1% | −2.9 |
| Musical Texture Interpretation | 73.5% | 70.6% | −2.9 |
| Eco-Acoustic Knowledge | 97.9% | 93.6% | −4.3 |
| Key Highlight Extraction | 100.0% | 95.0% | −5.0 |
| Phonological Sequence Decoding | 60.0% | 55.0% | −5.0 |
| Melodic Structure Interpretation | 51.5% | 45.5% | −6.0 |
| Emotion Flip Detection | 40.0% | 25.0% | −15.0 |
Some bins have n=20, so deltas in those should be read as suggestive.
Ablation: why single-stage beats two-stage here
An earlier internal version (v2) used a two-stage recipe: Stage 1 retrained the audio encoder on ~100k audio-caption pairs mixing Clotho, FSD50K, MusicBench, MusicQA, and LibriSpeech, with five generic captioning prompts rotated randomly across sources; Stage 2 was DoRA on the LM for audio QA. v2 scored below base on MMAU, with Phonemic Stress collapsing by ~40pp.
Reading the Qwen2-Audio paper and Google's Gemma docs gave two diagnoses:
- Generic shared prompts cause "one-to-many interference" across heterogeneous sources (Qwen-Audio v1 → v2 is a direct case study; v2 abandoned unified tags for natural-language task-specific prompts).
- 100k examples (≈ few hundred hours) is 2–3 orders of magnitude below the ~150k hours Qwen2-Audio used to successfully retrain a Whisper-family audio encoder. At this scale, encoder fine-tuning is under-determined and biases toward whichever domain has the cleanest labels (for us: LibriSpeech transcription), collapsing the rest.
v3 (this model) skips the encoder-retrain stage entirely. The audio encoder stays at its pretrained Google state; DoRA is applied only during the audio-QA pass. Phonemic Stress improves by +9.8pp instead of collapsing, isolating under-resourced encoder fine-tuning on mixed-domain captions as the harmful factor.
Training
- Base:
google/gemma-4-e4b-it - Adapter: DoRA, rank 32, α=64, dropout 0.05, target modules
q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj. Note: the target regex matched the audio encoder and vision tower layers in addition to the LM; this was wider than strictly intended, but did not hurt results. - Data:
bnovikov/gemma-4-e4b-audio-qa— 91k audio-QA rows from ClothoAQA, MusicQA, LibriSpeech-QA, and FSD50K-QA, each with task-specific natural-language prompts per source. - Steps: 200 optimizer steps (batch 4 × grad_accum 8 → effective 32).
- LR / schedule: 1e-5 cosine with warmup, bf16, gradient checkpointing.
- Hardware: single RTX 4090 24 GB. ~1 hour of GPU time.
- Note on step count: checkpoint-200 is shipped because the run hit a VRAM fragmentation OOM at the next save. This is the amount of training actually completed, not a tuned stopping point. A longer schedule on a bigger GPU may improve further; likewise it may regress — MMAU-level probes are needed during training to know.
Model size
After merging the adapter, this is a standalone model (no PEFT runtime needed at inference). 8.0B parameters total across the text decoder, audio tower, vision tower, and embeddings. Stored as 4 safetensors shards at ~16 GB total (mostly bf16 with fp32 norms / embeddings).
Intended use and limitations
- Audio QA over speech, music, and environmental sounds.
- Not a replacement for a dedicated ASR or captioning model.
- Emotion Flip Detection regressed vs base (−15pp). Avoid using this model for emotion-change-in-time tasks.
- Temporal Reasoning is near random in both base and this model (~10%).
- Subcategory confidence intervals are wide where n=20. Treat single-bin deltas as directional.
License
Subject to the Gemma Terms of Use. Derivative use must comply with Google's terms for the base model.
Citation
If you build on the ablation finding or the audio-QA data mix, please also
cite the four upstream datasets used to construct the training data
(LibriSpeech, ClothoAQA, MusicQA, FSD50K). See bnovikov/gemma-4-e4b-audio-qa
for citation details.
- Downloads last month
- 2