Silero VAD v6 (256ms) โ€” CoreML for mlx-audio

CoreML conversion of Silero VAD v6 unified 256ms variant, used as the default VAD pre-processor in mlx-audio for hallucination prevention on long-form ASR (e.g. Cohere Transcribe on real-world calls with silent intros / music outros).

Variant

  • Architecture: Silero v6 unified, 256ms (8 chunks of 32ms per inference)
  • Sample rate: 16 kHz
  • Input shape: (1, 4160) โ€” 64 sample context + 4096 audio samples
  • Output: speech probability (1,1,1) + new hidden_state + new cell_state
  • Compute: ALL (ANE preferred, GPU/CPU fallback)
  • Size: ~1 MB

Performance (M-series Apple Silicon)

On 44 min English audio @ 16 kHz:

  • Total VAD time: ~2.6 s (1017ร— real-time)
  • Per-block latency: ~250 ยตs
  • Detected 2304 s of speech, 358 s silence

Source

Converted from snakers4/silero-vad (MIT) via the conversion tooling in beshkenadze/AudioEnhanceKit. This is the same silero_vad_v6_256ms_ios16.mlpackage shipped in AEKit's scripts/coreml_output/.

Usage in mlx-audio

from mlx_audio.stt import load
model = load("CohereLabs/cohere-transcribe-03-2026")
out = model.generate("call.wav", language="en", vad=True)  # uses this VAD

The model is loaded transparently by cohere_asr.generate(vad=True). No manual download or path is needed; mlx-audio falls back here when the bundled copy isn't available, or when MLX_AUDIO_SILERO_COREML env var points elsewhere.

License

The original Silero VAD model is released under the MIT License. This conversion is published under the same MIT license.

Downloads last month
2
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support