--- license: mit library_name: onnxruntime tags: - onnx - int8 - quantized - speech-recognition - asr - streaming - moonshine - encoder-decoder pipeline_tag: automatic-speech-recognition base_model: UsefulSensors/moonshine-streaming-medium datasets: - librispeech_asr language: - en --- # Moonshine Streaming Medium — ONNX INT8 **Dynamic INT8 quantized** ONNX export of [UsefulSensors/moonshine-streaming-medium](https://huggingface.co/UsefulSensors/moonshine-streaming-medium) for fast CPU inference with ONNX Runtime. ## Model Overview Moonshine v2 Streaming is an encoder-decoder ASR model designed for **real-time streaming speech recognition**. The encoder uses causal sliding-window attention (no positional embeddings), enabling it to process audio incrementally. The decoder uses RoPE-based causal attention with cross-attention to encoder states. | | Details | |---|---| | **Architecture** | Encoder-Decoder Transformer (streaming) | | **Parameters** | ~330M (FP32) | | **Encoder** | 14 layers, 768-dim, 10 heads, sliding-window attention | | **Decoder** | 14 layers, 640-dim, 10 heads, RoPE + cross-attention | | **Vocab** | 32,768 BPE tokens | | **Audio Input** | 16 kHz mono, 5ms frames (80 samples) | | **Quantization** | Dynamic INT8 (weight-only, symmetric) | | **Latency** | Real-time capable on modern CPUs | ## Files | File | Size | Description | |------|------|-------------| | `encoder_model_int8.onnx` | 135 MB | Audio → 768-dim encoder hidden states | | `decoder_model_int8.onnx` | 225 MB | First decode step (initializes KV cache) | | `decoder_with_past_model_int8.onnx` | 202 MB | Subsequent decode steps (streaming KV cache) | | `config.json` | — | Model architecture configuration | | `tokenizer.json` | 3.6 MB | BPE tokenizer (32,768 vocab) | | `processor_config.json` | — | Audio processor settings | | `tokenizer_config.json` | — | Tokenizer metadata | | **Total** | **~562 MB** | **~64% smaller than FP32 (~1.57 GB)** | ## Quick Start ### Installation ```bash pip install onnxruntime numpy tokenizers ``` ### Basic Inference ```python import numpy as np import onnxruntime as ort from tokenizers import Tokenizer MODEL_DIR = "moonshine-streaming-medium-onnx" # Load models opts = ort.SessionOptions() opts.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL providers = ["CPUExecutionProvider"] encoder = ort.InferenceSession(f"{MODEL_DIR}/encoder_model_int8.onnx", opts, providers=providers) decoder = ort.InferenceSession(f"{MODEL_DIR}/decoder_model_int8.onnx", opts, providers=providers) decoder_past = ort.InferenceSession(f"{MODEL_DIR}/decoder_with_past_model_int8.onnx", opts, providers=providers) tokenizer = Tokenizer.from_file(f"{MODEL_DIR}/tokenizer.json") # Encode audio (16kHz float32) audio = np.random.randn(1, 16000).astype(np.float32) # 1 second mask = np.ones((1, 16000), dtype=np.int64) enc_out = encoder.run(None, {"input_values": audio, "attention_mask": mask})[0] # First decode step (BOS token = 1) bos = np.array([[1]], dtype=np.int64) first_out = decoder.run(None, {"decoder_input_ids": bos, "encoder_hidden_states": enc_out}) logits = first_out[0] token_id = int(np.argmax(logits[0, -1, :])) # Build KV cache mapping for subsequent steps dec_out_names = [o.name for o in decoder.get_outputs()][1:] dec_past_in_names = {i.name for i in decoder_past.get_inputs() if i.name not in ("decoder_input_ids", "encoder_hidden_states")} kv = {} for name, tensor in zip(dec_out_names, first_out[1:]): past_name = name.replace("present_", "past_", 1) mapped = past_name if past_name in dec_past_in_names else (name + "_orig" if name + "_orig" in dec_past_in_names else name) kv[mapped] = tensor # Autoregressive decoding tokens = [token_id] EOS = 2 while token_id != EOS and len(tokens) < 256: inputs = {"decoder_input_ids": np.array([[token_id]], dtype=np.int64), "encoder_hidden_states": enc_out} inputs.update(kv) past_out = decoder_past.run(None, inputs) token_id = int(np.argmax(past_out[0][0, -1, :])) tokens.append(token_id) # Update KV cache past_out_names = [o.name for o in decoder_past.get_outputs()][1:] kv = {} for name, tensor in zip(past_out_names, past_out[1:]): past_name = name.replace("present_", "past_", 1) mapped = past_name if past_name in dec_past_in_names else (name + "_orig" if name + "_orig" in dec_past_in_names else name) kv[mapped] = tensor text = tokenizer.decode(tokens) print(text) ``` ### Real-Time Microphone Streaming The companion CLI tool provides real-time streaming ASR with voice activity detection: ```bash pip install sounddevice python inference_moonshine.py --model-dir moonshine_streaming_medium ``` ## Architecture Details ### Three-Model Design The model is split into 3 ONNX graphs for efficient streaming: 1. **Encoder** — Processes raw audio with causal stride-2 convolutions and sliding-window attention. Outputs 768-dim hidden states at 50Hz (one frame per 20ms of audio). 2. **Decoder (first step)** — Takes BOS token + encoder states, produces first token logits and initializes 56 KV cache tensors (14 layers × 2 attention types × key+value). 3. **Decoder with past** — Takes previous token + encoder states + KV cache, produces next token logits and updated cache. Self-attention KV grows each step; cross-attention KV stays constant. ### Sliding Window Attention The encoder uses per-layer sliding window sizes for streaming efficiency: | Layers | Window | Lookahead | Purpose | |--------|--------|-----------|---------| | 0–1 | 16 frames | 4 frames | Initial context with lookahead | | 2–11 | 16 frames | 0 frames | Causal processing (no future) | | 12–13 | 16 frames | 4 frames | Final refinement with lookahead | ### Quantization Details - **Method**: Dynamic INT8 (weight-only) - **Target ops**: MatMul, Gemm (transformer compute) - **Weights**: Symmetric INT8 - **Activations**: Remain FP32 at runtime - **Audio frontend**: Conv/ConvTranspose kept at full precision - **Accuracy impact**: Negligible (<0.0001 max absolute encoder diff vs FP32) ## Comparison with Small | | Small | Medium | |---|---|---| | Encoder layers | 10 | 14 | | Encoder hidden | 620 | 768 | | Decoder layers | 10 | 14 | | Decoder hidden | 512 | 640 | | Attention heads | 8 | 10 | | INT8 size | ~358 MB | ~562 MB | | KV tensors | 40 | 56 | ## Execution Providers Works with any ONNX Runtime execution provider: | Provider | Platform | Notes | |----------|----------|-------| | `CPUExecutionProvider` | All | Default, always available | | `CoreMLExecutionProvider` | macOS | Hardware-accelerated on Apple Silicon | | `CUDAExecutionProvider` | Linux/Windows | NVIDIA GPU | | `DirectMLExecutionProvider` | Windows | DirectX 12 GPU | ## Export Reproduction To reproduce this export from the original safetensors: ```bash pip install "transformers>=5.2.0" "huggingface_hub>=0.23" torch onnx onnxruntime python export_moonshine_streaming_medium.py ``` Export script: [onnx-creator](https://github.com/subhankardori/onnx-creator) ## License This model inherits the [MIT License](https://huggingface.co/UsefulSensors/moonshine-streaming-medium) from the original Moonshine model by Useful Sensors. ## Citation ```bibtex @article{jeffries2024moonshine, title={Moonshine: Speech Recognition for Live Transcription and Voice Commands}, author={Jeffries, Nat and Silent, Evan}, journal={arXiv preprint arXiv:2410.15608}, year={2024} } ```