---
license: mit
library_name: onnxruntime
tags:
  - onnx
  - int8
  - quantized
  - speech-recognition
  - asr
  - streaming
  - moonshine
  - encoder-decoder
pipeline_tag: automatic-speech-recognition
base_model: UsefulSensors/moonshine-streaming-medium
datasets:
  - librispeech_asr
language:
  - en
---

# Moonshine Streaming Medium — ONNX INT8

**Dynamic INT8 quantized** ONNX export of [UsefulSensors/moonshine-streaming-medium](https://huggingface.co/UsefulSensors/moonshine-streaming-medium) for fast CPU inference with ONNX Runtime.

## Model Overview

Moonshine v2 Streaming is an encoder-decoder ASR model designed for **real-time streaming speech recognition**. The encoder uses causal sliding-window attention (no positional embeddings), enabling it to process audio incrementally. The decoder uses RoPE-based causal attention with cross-attention to encoder states.

| | Details |
|---|---|
| **Architecture** | Encoder-Decoder Transformer (streaming) |
| **Parameters** | ~330M (FP32) |
| **Encoder** | 14 layers, 768-dim, 10 heads, sliding-window attention |
| **Decoder** | 14 layers, 640-dim, 10 heads, RoPE + cross-attention |
| **Vocab** | 32,768 BPE tokens |
| **Audio Input** | 16 kHz mono, 5ms frames (80 samples) |
| **Quantization** | Dynamic INT8 (weight-only, symmetric) |
| **Latency** | Real-time capable on modern CPUs |

## Files

| File | Size | Description |
|------|------|-------------|
| `encoder_model_int8.onnx` | 135 MB | Audio → 768-dim encoder hidden states |
| `decoder_model_int8.onnx` | 225 MB | First decode step (initializes KV cache) |
| `decoder_with_past_model_int8.onnx` | 202 MB | Subsequent decode steps (streaming KV cache) |
| `config.json` | — | Model architecture configuration |
| `tokenizer.json` | 3.6 MB | BPE tokenizer (32,768 vocab) |
| `processor_config.json` | — | Audio processor settings |
| `tokenizer_config.json` | — | Tokenizer metadata |
| **Total** | **~562 MB** | **~64% smaller than FP32 (~1.57 GB)** |

## Quick Start

### Installation

```bash
pip install onnxruntime numpy tokenizers
```

### Basic Inference

```python
import numpy as np
import onnxruntime as ort
from tokenizers import Tokenizer

MODEL_DIR = "moonshine-streaming-medium-onnx"

# Load models
opts = ort.SessionOptions()
opts.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
providers = ["CPUExecutionProvider"]

encoder = ort.InferenceSession(f"{MODEL_DIR}/encoder_model_int8.onnx", opts, providers=providers)
decoder = ort.InferenceSession(f"{MODEL_DIR}/decoder_model_int8.onnx", opts, providers=providers)
decoder_past = ort.InferenceSession(f"{MODEL_DIR}/decoder_with_past_model_int8.onnx", opts, providers=providers)
tokenizer = Tokenizer.from_file(f"{MODEL_DIR}/tokenizer.json")

# Encode audio (16kHz float32)
audio = np.random.randn(1, 16000).astype(np.float32)  # 1 second
mask = np.ones((1, 16000), dtype=np.int64)
enc_out = encoder.run(None, {"input_values": audio, "attention_mask": mask})[0]

# First decode step (BOS token = 1)
bos = np.array([[1]], dtype=np.int64)
first_out = decoder.run(None, {"decoder_input_ids": bos, "encoder_hidden_states": enc_out})
logits = first_out[0]
token_id = int(np.argmax(logits[0, -1, :]))

# Build KV cache mapping for subsequent steps
dec_out_names = [o.name for o in decoder.get_outputs()][1:]
dec_past_in_names = {i.name for i in decoder_past.get_inputs() if i.name not in ("decoder_input_ids", "encoder_hidden_states")}
kv = {}
for name, tensor in zip(dec_out_names, first_out[1:]):
    past_name = name.replace("present_", "past_", 1)
    mapped = past_name if past_name in dec_past_in_names else (name + "_orig" if name + "_orig" in dec_past_in_names else name)
    kv[mapped] = tensor

# Autoregressive decoding
tokens = [token_id]
EOS = 2
while token_id != EOS and len(tokens) < 256:
    inputs = {"decoder_input_ids": np.array([[token_id]], dtype=np.int64), "encoder_hidden_states": enc_out}
    inputs.update(kv)
    past_out = decoder_past.run(None, inputs)
    token_id = int(np.argmax(past_out[0][0, -1, :]))
    tokens.append(token_id)
    # Update KV cache
    past_out_names = [o.name for o in decoder_past.get_outputs()][1:]
    kv = {}
    for name, tensor in zip(past_out_names, past_out[1:]):
        past_name = name.replace("present_", "past_", 1)
        mapped = past_name if past_name in dec_past_in_names else (name + "_orig" if name + "_orig" in dec_past_in_names else name)
        kv[mapped] = tensor

text = tokenizer.decode(tokens)
print(text)
```

### Real-Time Microphone Streaming

The companion CLI tool provides real-time streaming ASR with voice activity detection:

```bash
pip install sounddevice
python inference_moonshine.py --model-dir moonshine_streaming_medium
```

## Architecture Details

### Three-Model Design

The model is split into 3 ONNX graphs for efficient streaming:

1. **Encoder** — Processes raw audio with causal stride-2 convolutions and sliding-window attention. Outputs 768-dim hidden states at 50Hz (one frame per 20ms of audio).

2. **Decoder (first step)** — Takes BOS token + encoder states, produces first token logits and initializes 56 KV cache tensors (14 layers × 2 attention types × key+value).

3. **Decoder with past** — Takes previous token + encoder states + KV cache, produces next token logits and updated cache. Self-attention KV grows each step; cross-attention KV stays constant.

### Sliding Window Attention

The encoder uses per-layer sliding window sizes for streaming efficiency:

| Layers | Window | Lookahead | Purpose |
|--------|--------|-----------|---------|
| 0–1 | 16 frames | 4 frames | Initial context with lookahead |
| 2–11 | 16 frames | 0 frames | Causal processing (no future) |
| 12–13 | 16 frames | 4 frames | Final refinement with lookahead |

### Quantization Details

- **Method**: Dynamic INT8 (weight-only)
- **Target ops**: MatMul, Gemm (transformer compute)
- **Weights**: Symmetric INT8
- **Activations**: Remain FP32 at runtime
- **Audio frontend**: Conv/ConvTranspose kept at full precision
- **Accuracy impact**: Negligible (<0.0001 max absolute encoder diff vs FP32)

## Comparison with Small

| | Small | Medium |
|---|---|---|
| Encoder layers | 10 | 14 |
| Encoder hidden | 620 | 768 |
| Decoder layers | 10 | 14 |
| Decoder hidden | 512 | 640 |
| Attention heads | 8 | 10 |
| INT8 size | ~358 MB | ~562 MB |
| KV tensors | 40 | 56 |

## Execution Providers

Works with any ONNX Runtime execution provider:

| Provider | Platform | Notes |
|----------|----------|-------|
| `CPUExecutionProvider` | All | Default, always available |
| `CoreMLExecutionProvider` | macOS | Hardware-accelerated on Apple Silicon |
| `CUDAExecutionProvider` | Linux/Windows | NVIDIA GPU |
| `DirectMLExecutionProvider` | Windows | DirectX 12 GPU |

## Export Reproduction

To reproduce this export from the original safetensors:

```bash
pip install "transformers>=5.2.0" "huggingface_hub>=0.23" torch onnx onnxruntime
python export_moonshine_streaming_medium.py
```

Export script: [onnx-creator](https://github.com/subhankardori/onnx-creator)

## License

This model inherits the [MIT License](https://huggingface.co/UsefulSensors/moonshine-streaming-medium) from the original Moonshine model by Useful Sensors.

## Citation

```bibtex
@article{jeffries2024moonshine,
  title={Moonshine: Speech Recognition for Live Transcription and Voice Commands},
  author={Jeffries, Nat and Silent, Evan},
  journal={arXiv preprint arXiv:2410.15608},
  year={2024}
}
```