Qwen3-ASR-0.6B Streaming (causal audio tower)
A streaming/causal drop-in audio tower for Qwen/Qwen3-ASR-0.6B: the pretrained (offline, fully bidirectional) audio encoder, fine-tuned to run append-only: each ~2 s audio block is encoded exactly once with a causal KV cache and a bounded 15 s attention window, and is never re-encoded. New audio costs one encoder pass over the new block plus an incremental decoder update; per-chunk compute is constant in stream length, memory is bounded, and streams can run indefinitely.
It is built for the minimum-compute-per-chunk regime: many concurrent streams, energy-constrained or on-device serving, and sessions of unbounded length. Per audio second it spends about 3x less than the re-compute streaming backend, at constant cost as the stream grows, in exchange for some accuracy (see the results table for the honest comparison).
This repository contains only the fine-tuned audio tower (746 MB fp32 safetensors). The decoder, adapter and feature extractor are loaded unchanged from the base model at runtime.
Results
Long-form: 21 full MCIF/ACL conference talks (5 to 7 minutes each, 2 h total, accented scientific English), human references, Whisper text normalization:
| system | scoring contract | WER | compute per second of audio |
|---|---|---|---|
| Streaming Qwen3-ASR | no rewrite, 250 ms tail cut + EOS flush | 12.6 | 126 GFLOPs avg, growing to 172 within each segment |
| Streaming Qwen3-ASR causal | no rewrite, 250 ms tail cut + EOS flush | 18.1 | 42 GFLOPs, constant |
Streaming WER contract
The WER numbers above use the same real-streaming replay: no right to rewrite
the past, a 250 ms live-tail cut at every non-final update, and an
end-of-stream flush. When forced-aligner word_alignments are present, the 250
ms cut uses those word timestamps; otherwise it falls back to a uniform text
approximation.
How it works
- Block-bidirectional causal execution: full attention within each fixed block (192 mel frames = 1.92 s; a 96-frame variant is also in-regime), causal per-layer KV across blocks, sliding 15 s left window, sinusoidal positions continuing monotonically. Latency = one block.
- Trained by self-distillation: the causal student matches the frozen offline tower's output embeddings (MSE + 0.5·cosine) on LibriSpeech 960 h, audio only, no labels, with mixed block sizes (96/192) and position-offset augmentation (log-uniform up to 6000 steps ≈ 2 h) so positions extrapolate far beyond the 120 s table. About 6 H100-hours total.
- Segmented long-form serving: segments roll at sentence punctuation (min ~12 s, 16 s cap) and the encoder chain resets per segment, which bounds drift; cutting at linguistic boundaries measured better than fixed-length cuts (18.1 vs 18.7 WER).
- The serving stack adds a rolling decoder KV (the [prompt + audio] prefix persists across updates) and lossless speculative re-decoding (the previous hypothesis is verified in one parallel pass), making the decoder side incremental too.
Usage (WhisperLiveKit)
pip install "whisperlivekit[qwen3-streaming]"
wlk --backend qwen3-streaming --language en \
--qwen3-streaming-audio-backend causal \
--qwen3-streaming-tower-checkpoint qfuxa/qwen3-asr-0.6b-streaming
The tower downloads automatically; the base model comes from
Qwen/Qwen3-ASR-0.6B. Works on CUDA, Apple Silicon (MPS) and CPU. See
WhisperLiveKit for the
WebSocket server, web UI and the OpenAI-compatible REST endpoint.
Programmatic use:
from whisperlivekit.qwen3_streaming.asr import Qwen3StreamingASR
asr = Qwen3StreamingASR(
lan="en",
qwen3_streaming_audio_backend="causal",
qwen3_streaming_tower_checkpoint="qfuxa/qwen3-asr-0.6b-streaming",
)
streamer = asr.build_streamer("en")
# feed mel chunks of any size; the encoder consumes fixed 1.92 s blocks
Limitations
- English only for now. The recipe is language-agnostic and cheap; multilingual versions can follow the same ~6 GPU-hour path.
- Validated for the 0.6B base only (the tower must match the base model).
- Structural latency is one block (~1.9 s; 0.96 s with
block_frames=96at a small quality cost). - Long-form robustness relies on segment resets at sentence boundaries; the encoder chain is bounded to one segment (~12-16 s) by design.
- Accuracy is below standard streaming Qwen3-ASR under the same no-rewrite contract (18.1 vs 12.6 WER long-form): you are trading accuracy for ~constant per-chunk compute.
Provenance & license
Derived from Qwen/Qwen3-ASR-0.6B
(Apache 2.0, © Alibaba Cloud). Modifications: the audio tower weights were
fine-tuned (embedding self-distillation, 123k steps total) to support
block-causal streaming execution; everything else is unchanged. Released under
Apache 2.0. Training/eval code, experiment log, and the streaming runtime live
in WhisperLiveKit
(experiments/qwen3-causal/RUNS.md).
- Downloads last month
- 236
Model tree for qfuxa/qwen3-asr-0.6b-streaming
Base model
Qwen/Qwen3-ASR-0.6B