Gemma-4 E4B Audio — v3 (Stage 2 DoRA, merged)

Single-stage DoRA fine-tune of google/gemma-4-e4b-it on audio-QA data. Merged adapter, no external PEFT runtime required. Beats the base model by +3.8pp on MMAU test_mini (1000 samples, same pipeline).

The primary contribution is a negative finding: two-stage recipes that first re-train the audio encoder on ~100k mixed-domain captions are worse than doing nothing to the encoder and only DoRA-adapting on task-specific audio QA. See "Ablation" below.

Results (MMAU test_mini, 1000 samples, identical pipeline)

Model	Overall
Base `google/gemma-4-e4b-it`	55.0%
This model (v3 @ step 200)	58.8%

Per-subcategory Δ (this model − base)

Wins:

Subcategory	Base	v3	Δ
Emotion State Summarisation	45.0%	60.0%	+15.0
Event-Based Knowledge Retrieval	75.0%	90.0%	+15.0
Ambient Sound Interpretation	14.6%	27.1%	+12.5
Acoustic Scene Reasoning	41.7%	52.1%	+10.4
Phonemic Stress Pattern Analysis	33.3%	43.1%	+9.8
Event-Based Sound Reasoning	66.7%	75.0%	+8.3
Temporal Event Reasoning	27.1%	33.3%	+6.2
Emotional Tone Interpretation	69.7%	75.8%	+6.1
Dissonant Emotion Interpretation	90.0%	95.0%	+5.0
Socio-cultural Interpretation	80.0%	85.0%	+5.0

Neutral (within ±3pp): Conversational Fact Retrieval, Counting, Harmony, Lyrical Reasoning, Multi-Speaker Role Mapping, Sound-Based Event Recognition, Musical Genre Reasoning, Rhythm and Tempo, Acoustic Source Inference.

Regressions (disclosed honestly):

Subcategory	Base	v3	Δ
Instrumentation	60.0%	57.1%	−2.9
Musical Texture Interpretation	73.5%	70.6%	−2.9
Eco-Acoustic Knowledge	97.9%	93.6%	−4.3
Key Highlight Extraction	100.0%	95.0%	−5.0
Phonological Sequence Decoding	60.0%	55.0%	−5.0
Melodic Structure Interpretation	51.5%	45.5%	−6.0
Emotion Flip Detection	40.0%	25.0%	−15.0

Some bins have n=20, so deltas in those should be read as suggestive.

Ablation: why single-stage beats two-stage here

An earlier internal version (v2) used a two-stage recipe: Stage 1 retrained the audio encoder on ~100k audio-caption pairs mixing Clotho, FSD50K, MusicBench, MusicQA, and LibriSpeech, with five generic captioning prompts rotated randomly across sources; Stage 2 was DoRA on the LM for audio QA. v2 scored below base on MMAU, with Phonemic Stress collapsing by ~40pp.

Reading the Qwen2-Audio paper and Google's Gemma docs gave two diagnoses:

Generic shared prompts cause "one-to-many interference" across heterogeneous sources (Qwen-Audio v1 → v2 is a direct case study; v2 abandoned unified tags for natural-language task-specific prompts).
100k examples (≈ few hundred hours) is 2–3 orders of magnitude below the ~150k hours Qwen2-Audio used to successfully retrain a Whisper-family audio encoder. At this scale, encoder fine-tuning is under-determined and biases toward whichever domain has the cleanest labels (for us: LibriSpeech transcription), collapsing the rest.

v3 (this model) skips the encoder-retrain stage entirely. The audio encoder stays at its pretrained Google state; DoRA is applied only during the audio-QA pass. Phonemic Stress improves by +9.8pp instead of collapsing, isolating under-resourced encoder fine-tuning on mixed-domain captions as the harmful factor.

Training

Base: google/gemma-4-e4b-it
Adapter: DoRA, rank 32, α=64, dropout 0.05, target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj. Note: the target regex matched the audio encoder and vision tower layers in addition to the LM; this was wider than strictly intended, but did not hurt results.
Data: bnovikov/gemma-4-e4b-audio-qa — 91k audio-QA rows from ClothoAQA, MusicQA, LibriSpeech-QA, and FSD50K-QA, each with task-specific natural-language prompts per source.
Steps: 200 optimizer steps (batch 4 × grad_accum 8 → effective 32).
LR / schedule: 1e-5 cosine with warmup, bf16, gradient checkpointing.
Hardware: single RTX 4090 24 GB. ~1 hour of GPU time.
Note on step count: checkpoint-200 is shipped because the run hit a VRAM fragmentation OOM at the next save. This is the amount of training actually completed, not a tuned stopping point. A longer schedule on a bigger GPU may improve further; likewise it may regress — MMAU-level probes are needed during training to know.

Model size

After merging the adapter, this is a standalone model (no PEFT runtime needed at inference). 8.0B parameters total across the text decoder, audio tower, vision tower, and embeddings. Stored as 4 safetensors shards at ~16 GB total (mostly bf16 with fp32 norms / embeddings).

Intended use and limitations

Audio QA over speech, music, and environmental sounds.
Not a replacement for a dedicated ASR or captioning model.
Emotion Flip Detection regressed vs base (−15pp). Avoid using this model for emotion-change-in-time tasks.
Temporal Reasoning is near random in both base and this model (~10%).
Subcategory confidence intervals are wide where n=20. Treat single-bin deltas as directional.

License

Subject to the Gemma Terms of Use. Derivative use must comply with Google's terms for the base model.

Citation

If you build on the ablation finding or the audio-QA data mix, please also cite the four upstream datasets used to construct the training data (LibriSpeech, ClothoAQA, MusicQA, FSD50K). See bnovikov/gemma-4-e4b-audio-qa for citation details.

Downloads last month: 2

Safetensors

Model size

8B params

Tensor type

F32

BF16

Inference Providers NEW

Audio-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

bnovikov
/

gemma-4-e4b-audio-v3

Gemma-4 E4B Audio — v3 (Stage 2 DoRA, merged)

Results (MMAU test_mini, 1000 samples, identical pipeline)

Per-subcategory Δ (this model − base)

Ablation: why single-stage beats two-stage here

Training

Model size

Intended use and limitations

License

Citation

Dataset used to train bnovikov/gemma-4-e4b-audio-v3