Omnilingual ASR — CTC 1B (MLX 4-bit)

MLX-compatible 4-bit quantization of Meta's Omnilingual ASR CTC-1B model for on-device inference on Apple Silicon (M1/M2/M3/M4). The 1B variant trades ~360 MB of extra disk vs. the 300M build for meaningfully better accuracy on low-resource languages (per Meta's published CER/WER curves on FLEURS).

Omnilingual ASR is a wav2vec 2.0-style encoder-only model with a linear CTC head, trained by Meta for speech recognition across 1,600+ languages. The CTC variant is language-agnostic at inference time.

Model

Parameters 1.01 B
Format MLX safetensors (quantized linear layers + fp16 features)
Quantization 4-bit per-group min-max, group size 64
Encoder layers 48
Encoder dim 1280
Attention heads 20
FFN dim 5120
Sample rate 16 kHz (raw waveform input)
Frame rate 50 fps
Max duration 40 s
Languages 1,600+
Vocabulary 10,288 SentencePiece tokens

Files

File Size Description
model.safetensors 549 MB 4-bit quantized transformer weights + fp16 conv frontend
tokenizer.model 1.2 MB SentencePiece tokenizer (unk=3, pad=1, eos=2, bos=0)
config.json <1 KB Architecture + quantization metadata

Architecture

Raw audio [1, samples]
  → Wav2Vec2FeatureExtractor (7-layer 1D conv, stride 320×)
  → Linear 512 → 1280
  → Wav2Vec2PositionEncoder (weight-normalized conv, kernel 128, groups 16)
  → 48 × StandardTransformerEncoderLayer (pre-norm, dim 1280, heads 20, ffn 5120)
  → LayerNorm
  → Linear 1280 → 10288   (CTC head)
  → logits [1, T/320, 10288]

CTC greedy decoding with duplicate collapsing over the argmax path.

Performance

Meta's upstream omniASR-CTC-1B model card reports substantial improvements over CTC-300M on low-resource languages. Our MLX 4-bit export preserves the same architecture and should land within ~1% absolute WER of fp32 on wav2vec2-class models.

For direct comparison with CTC-300M on FLEURS see the 300M 4-bit card.

Usage

import mlx.core as mx
from safetensors import safe_open

weights = {}
with safe_open("model.safetensors", framework="mlx") as f:
    for k in f.keys():
        weights[k] = f.get_tensor(k)

# Your MLX wav2vec2 + CTC implementation consumes these keys.
# Input : float32 audio [1, samples] at 16 kHz, zero-mean unit-var
# Output: logits [1, T, 10288] then CTC greedy decode via tokenizer.model

Swift inference is provided by speech-swift.

Source

Links

License

Apache 2.0 (inherited from upstream).


Downloads last month
33
Safetensors
Model size
0.2B params
Tensor type
U32
·
F16
·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for aufklarer/Omnilingual-ASR-CTC-1B-MLX-4bit

Finetuned
(2)
this model

Collection including aufklarer/Omnilingual-ASR-CTC-1B-MLX-4bit

Paper for aufklarer/Omnilingual-ASR-CTC-1B-MLX-4bit