Silero VAD v6 (256ms) — CoreML for mlx-audio

CoreML conversion of Silero VAD v6 unified 256ms variant, used as the default VAD pre-processor in mlx-audio for hallucination prevention on long-form ASR (e.g. Cohere Transcribe on real-world calls with silent intros / music outros).

Variant

Architecture: Silero v6 unified, 256ms (8 chunks of 32ms per inference)
Sample rate: 16 kHz
Input shape: (1, 4160) — 64 sample context + 4096 audio samples
Output: speech probability (1,1,1) + new hidden_state + new cell_state
Compute: ALL (ANE preferred, GPU/CPU fallback)
Size: ~1 MB

Performance (M-series Apple Silicon)

On 44 min English audio @ 16 kHz:

Total VAD time: ~2.6 s (1017× real-time)
Per-block latency: ~250 µs
Detected 2304 s of speech, 358 s silence

Source

Converted from snakers4/silero-vad (MIT) via the conversion tooling in beshkenadze/AudioEnhanceKit. This is the same silero_vad_v6_256ms_ios16.mlpackage shipped in AEKit's scripts/coreml_output/.

Usage in mlx-audio

from mlx_audio.stt import load
model = load("CohereLabs/cohere-transcribe-03-2026")
out = model.generate("call.wav", language="en", vad=True)  # uses this VAD

The model is loaded transparently by cohere_asr.generate(vad=True). No manual download or path is needed; mlx-audio falls back here when the bundled copy isn't available, or when MLX_AUDIO_SILERO_COREML env var points elsewhere.

License

The original Silero VAD model is released under the MIT License. This conversion is published under the same MIT license.

Downloads last month: 2

Inference Providers NEW

Voice Activity Detection

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support