ARK-ASR-3B: State-of-the-Art Multilingual ASR

TL;DR ARK-ASR-3B is a multilingual automatic speech recognition model. It achieves current state-of-the-art results on the Hugging Face Open ASR Leaderboard English short-form benchmark, with an average WER of 5.13% across AMI, Earnings22, GigaSpeech, LibriSpeech, SPGISpeech, and VoxPopuli. The accompanying training, inference, and evaluation code is available at AutoArk/open-audio-opd.

Abstract

ARK-ASR-3B is a 3B-scale audio-capable autoregressive Transformers model for automatic speech recognition.

It combines a Whisper-style audio encoder, an MLP adapter, and a Qwen decoder with custom arkasr remote code.

ARK-ASR currently supports Chinese, English, German, Japanese, French, Korean, Spanish, Polish, Italian, Romanian, Hungarian, Czech, Dutch, Finnish, Croatian, Slovak, Slovene, Estonian, and Lithuanian ASR.

Supported Languages

Chinese, English, German, Japanese, French, Korean, Spanish, Polish, Italian, Romanian, Hungarian, Czech, Dutch, Finnish, Croatian, Slovak, Slovene, Estonian, and Lithuanian.

Model Overview

Figure 1: ARK-ASR architecture. Audio is encoded by a Whisper-style encoder with RoPE, merged through an MLP adapter, and injected into a Qwen decoder by replacing audio placeholder token embeddings before transcript generation.

Model size: 3B-scale decoder LLM with a dedicated Whisper-style audio encoder and MLP adapter
Task: automatic speech recognition
Architecture: audio-capable autoregressive Transformers model with custom arkasr remote code
Checkpoint format: safetensors
Sampling rate: 16 kHz
Recommended inference code: scripts/infer/ark_asr_transformers.py
vLLM serving: scripts/vllm/ark_asr_vllm

The model should be loaded with trust_remote_code=True. The official inference script handles the processor, tokenizer, audio prompt format, generation cleanup, and ASR token filtering.

Performance

The following results are from the Hugging Face Open ASR Leaderboard. Lower WER is better. ARK-ASR-3B reaches the current state of the art on this English short-form benchmark.

English WER

Model	AMI	Earnings22	GigaSpeech	LS Clean	LS Other	SPGISpeech	VoxPopuli	Avg
ARK-ASR-3B	8.91%	8.25%	7.30%	1.09%	2.41%	2.49%	5.48%	5.13%
ARK-ASR-0.6B	10.02%	9.77%	8.00%	1.53%	3.51%	2.63%	6.31%	5.97%

Inference

Run ASR inference with Hugging Face Transformers:

import torch
from transformers import AutoModelForCausalLM, AutoProcessor, AutoTokenizer

model_path = "AutoArk-AI/ARK-ASR-3B"
audio_path = "assets/libai.wav"

device = "cuda" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.bfloat16 if device == "cuda" else torch.float32

processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    trust_remote_code=True,
    torch_dtype=torch_dtype,
    attn_implementation="sdpa",
).to(device)
model.eval()


def build_bad_words_ids(tokenizer):
    eos_ids = tokenizer.eos_token_id
    keep_ids = {eos_ids} if isinstance(eos_ids, int) else set(eos_ids or [])
    bad_ids = set(tokenizer.all_special_ids) - keep_ids
    bad_ids.update(
        token_id
        for token, token_id in tokenizer.get_added_vocab().items()
        if token.startswith("<") and token.endswith(">") and token_id not in keep_ids
    )
    return [[token_id] for token_id in sorted(bad_ids)]

conversation = [
    {
        "role": "user",
        "content": [
            {"type": "audio", "path": audio_path},
            {"type": "text", "text": "Please transcribe this audio."},
        ],
    }
]

inputs = processor.apply_chat_template(
    conversation,
    add_generation_prompt=True,
    return_tensors="pt",
    sampling_rate=16000,
    audio_padding="longest",
    text_kwargs={"padding": "longest"},
    audio_max_length=30 * 16000,
)
inputs = inputs.to(device)
if "audios" in inputs:
    inputs["audios"] = inputs["audios"].to(dtype=torch_dtype)

bad_words_ids = build_bad_words_ids(tokenizer)
with torch.inference_mode():
    outputs = model.generate(
        **inputs,
        do_sample=False,
        max_new_tokens=256,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id,
        bad_words_ids=bad_words_ids,
    )
decoded_outputs = tokenizer.batch_decode(
    outputs[:, inputs.input_ids.shape[1] :],
    skip_special_tokens=True,
)
print(decoded_outputs)

For batch JSONL inference, use the open-source inference code:

git clone https://github.com/AutoArk/open-audio-opd
cd open-audio-opd
pip install -e .

The input JSONL should contain one ASR sample per line:

{"audio":"/path/to/audio.wav","text":"","task":"asr","begin_time":-1,"end_time":-1}

python scripts/infer/ark_asr_transformers.py \
  --input /path/to/input.jsonl \
  --output runs/infer/predictions.jsonl \
  --model_path AutoArk-AI/ARK-ASR-3B \
  --processor_path AutoArk-AI/ARK-ASR-3B \
  --batch_size 40 \
  --dtype bfloat16 \
  --attn_impl sdpa

The output JSONL preserves input metadata and adds:

pred_text: cleaned prediction text for downstream evaluation
pred_text_raw: raw decoded generation before cleanup

vLLM Online Serving

ARK-ASR can also be deployed as a vLLM-backed online ASR service with the adapter in scripts/vllm/ark_asr_vllm. The service exposes both a compact /asr endpoint and an OpenAI-style /v1/audio/transcriptions endpoint.

Clone and install the serving code:

git clone https://github.com/AutoArk/open-audio-opd
cd open-audio-opd
pip install -e ".[vllm]"

Start the service:

MODEL=AutoArk-AI/ARK-ASR-3B \
GPU=0 \
PORT=8025 \
scripts/vllm/deploy_ark_asr_vllm_service.sh start

Check the service:

scripts/vllm/deploy_ark_asr_vllm_service.sh status
curl -sS http://127.0.0.1:8025/health
curl -sS http://127.0.0.1:8025/token-mask

Run one transcription request:

curl -sS -X POST http://127.0.0.1:8025/asr \
  -F file=@/path/to/audio.wav \
  -F max_new_tokens=256

OpenAI-style transcription endpoint:

curl -sS -X POST http://127.0.0.1:8025/v1/audio/transcriptions \
  -F file=@/path/to/audio.wav \
  -F model=ark-asr

Stop the service:

scripts/vllm/deploy_ark_asr_vllm_service.sh stop

The vLLM adapter registers the custom arkasr model, loads the local processor/tokenizer with trust_remote_code=True, applies generation-time token masking for non-ASR control tokens, and keeps <|im_end|> as the stop token. Service logs and PID files are written under runs/vllm/.

Evaluation

The reported leaderboard numbers are evaluated with the Hugging Face open_asr_leaderboard evaluation code.

For local J/WER evaluation, the repository also includes this entrypoint:

python scripts/eval/eval_jwer_ark_asr_transformers.py \
  --input /path/to/test.jsonl \
  --output runs/eval/result.jsonl \
  --model_path AutoArk-AI/ARK-ASR-3B \
  --processor_path AutoArk-AI/ARK-ASR-3B \
  --batch_size 40 \
  --dtype bfloat16 \
  --attn_impl sdpa

No evaluation audio or dataset files are bundled with this model repository.

Acknowledgements

The training code is based on THUNLP/OPD and verl. The OPD recipe uses a stronger ASR teacher to score online student rollouts.

Citation

If you find ARK-ASR or open-audio-opd useful, please cite:

@misc{lin2026dataefficientopd,
  title={Data-Efficient On-Policy Distillation for Automatic Speech Recognition},
  author={Lin, Yu and Wang, Yiming and Cai, Runyuan and Zeng, Xiaodong},
  year={2026},
  eprint={2605.28139},
  archivePrefix={arXiv},
  primaryClass={cs.AI},
  url={https://arxiv.org/abs/2605.28139}
}

Downloads last month: -

Safetensors

Model size

4B params

Tensor type

BF16

Paper for AutoArk-AI/ARK-ASR-3B

Data-Efficient On-Policy Distillation for Automatic Speech Recognition

Paper • 2605.28139 • Published 27 days ago • 3

Evaluation results

hf-audio/open-asr-leaderboard leaderboard
Mean Wer View evaluation results

source

5.13
Ami Wer View evaluation results

source

8.91
Earnings22 Wer View evaluation results

source

8.25