Silero VAD v6 (256ms) โ CoreML for mlx-audio
CoreML conversion of Silero VAD v6 unified 256ms variant, used as the default VAD pre-processor in mlx-audio for hallucination prevention on long-form ASR (e.g. Cohere Transcribe on real-world calls with silent intros / music outros).
Variant
- Architecture: Silero v6 unified, 256ms (8 chunks of 32ms per inference)
- Sample rate: 16 kHz
- Input shape:
(1, 4160)โ 64 sample context + 4096 audio samples - Output: speech probability (1,1,1) + new hidden_state + new cell_state
- Compute:
ALL(ANE preferred, GPU/CPU fallback) - Size: ~1 MB
Performance (M-series Apple Silicon)
On 44 min English audio @ 16 kHz:
- Total VAD time: ~2.6 s (1017ร real-time)
- Per-block latency: ~250 ยตs
- Detected 2304 s of speech, 358 s silence
Source
Converted from snakers4/silero-vad
(MIT) via the conversion tooling in
beshkenadze/AudioEnhanceKit.
This is the same silero_vad_v6_256ms_ios16.mlpackage shipped in AEKit's
scripts/coreml_output/.
Usage in mlx-audio
from mlx_audio.stt import load
model = load("CohereLabs/cohere-transcribe-03-2026")
out = model.generate("call.wav", language="en", vad=True) # uses this VAD
The model is loaded transparently by cohere_asr.generate(vad=True). No manual
download or path is needed; mlx-audio falls back here when the bundled copy
isn't available, or when MLX_AUDIO_SILERO_COREML env var points elsewhere.
License
The original Silero VAD model is released under the MIT License. This conversion is published under the same MIT license.
- Downloads last month
- 2