Qwen3-ASR-0.6B Streaming (causal audio tower)

A streaming/causal drop-in audio tower for Qwen/Qwen3-ASR-0.6B: the pretrained (offline, fully bidirectional) audio encoder, fine-tuned to run append-only: each ~2 s audio block is encoded exactly once with a causal KV cache and a bounded 15 s attention window, and is never re-encoded. New audio costs one encoder pass over the new block plus an incremental decoder update; per-chunk compute is constant in stream length, memory is bounded, and streams can run indefinitely.

It is built for the minimum-compute-per-chunk regime: many concurrent streams, energy-constrained or on-device serving, and sessions of unbounded length. Per audio second it spends about 3x less than the re-compute streaming backend, at constant cost as the stream grows, in exchange for some accuracy (see the results table for the honest comparison).

This repository contains only the fine-tuned audio tower (746 MB fp32 safetensors). The decoder, adapter and feature extractor are loaded unchanged from the base model at runtime.

Results

Long-form: 21 full MCIF/ACL conference talks (5 to 7 minutes each, 2 h total, accented scientific English), human references, Whisper text normalization:

system	scoring contract	WER	compute per second of audio
Streaming Qwen3-ASR	no rewrite, 250 ms tail cut + EOS flush	12.6	126 GFLOPs avg, growing to 172 within each segment
Streaming Qwen3-ASR causal	no rewrite, 250 ms tail cut + EOS flush	18.1	42 GFLOPs, constant

Streaming WER contract

The WER numbers above use the same real-streaming replay: no right to rewrite the past, a 250 ms live-tail cut at every non-final update, and an end-of-stream flush. When forced-aligner word_alignments are present, the 250 ms cut uses those word timestamps; otherwise it falls back to a uniform text approximation.

How it works

Block-bidirectional causal execution: full attention within each fixed block (192 mel frames = 1.92 s; a 96-frame variant is also in-regime), causal per-layer KV across blocks, sliding 15 s left window, sinusoidal positions continuing monotonically. Latency = one block.
Trained by self-distillation: the causal student matches the frozen offline tower's output embeddings (MSE + 0.5·cosine) on LibriSpeech 960 h, audio only, no labels, with mixed block sizes (96/192) and position-offset augmentation (log-uniform up to 6000 steps ≈ 2 h) so positions extrapolate far beyond the 120 s table. About 6 H100-hours total.
Segmented long-form serving: segments roll at sentence punctuation (min ~12 s, 16 s cap) and the encoder chain resets per segment, which bounds drift; cutting at linguistic boundaries measured better than fixed-length cuts (18.1 vs 18.7 WER).
The serving stack adds a rolling decoder KV (the [prompt + audio] prefix persists across updates) and lossless speculative re-decoding (the previous hypothesis is verified in one parallel pass), making the decoder side incremental too.

Usage (WhisperLiveKit)

pip install "whisperlivekit[qwen3-streaming]"

wlk --backend qwen3-streaming --language en \
    --qwen3-streaming-audio-backend causal \
    --qwen3-streaming-tower-checkpoint qfuxa/qwen3-asr-0.6b-streaming

The tower downloads automatically; the base model comes from Qwen/Qwen3-ASR-0.6B. Works on CUDA, Apple Silicon (MPS) and CPU. See WhisperLiveKit for the WebSocket server, web UI and the OpenAI-compatible REST endpoint.

Programmatic use:

from whisperlivekit.qwen3_streaming.asr import Qwen3StreamingASR

asr = Qwen3StreamingASR(
    lan="en",
    qwen3_streaming_audio_backend="causal",
    qwen3_streaming_tower_checkpoint="qfuxa/qwen3-asr-0.6b-streaming",
)
streamer = asr.build_streamer("en")
# feed mel chunks of any size; the encoder consumes fixed 1.92 s blocks

Limitations

English only for now. The recipe is language-agnostic and cheap; multilingual versions can follow the same ~6 GPU-hour path.
Validated for the 0.6B base only (the tower must match the base model).
Structural latency is one block (~1.9 s; 0.96 s with block_frames=96 at a small quality cost).
Long-form robustness relies on segment resets at sentence boundaries; the encoder chain is bounded to one segment (~12-16 s) by design.
Accuracy is below standard streaming Qwen3-ASR under the same no-rewrite contract (18.1 vs 12.6 WER long-form): you are trading accuracy for ~constant per-chunk compute.

Provenance & license

Derived from Qwen/Qwen3-ASR-0.6B (Apache 2.0, © Alibaba Cloud). Modifications: the audio tower weights were fine-tuned (embedding self-distillation, 123k steps total) to support block-causal streaming execution; everything else is unchanged. Released under Apache 2.0. Training/eval code, experiment log, and the streaming runtime live in WhisperLiveKit (experiments/qwen3-causal/RUNS.md).

Downloads last month: 236

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for qfuxa/qwen3-asr-0.6b-streaming

Base model

Qwen/Qwen3-ASR-0.6B

Finetuned

(34)

this model