--- library_name: transformers tags: - automatic-speech-recognition - speech - audio - transformers - pytorch - safetensors - vllm - ark-asr pipeline_tag: automatic-speech-recognition language: - zh - en - de - ja - fr - ko - es - pl - it - ro - hu - cs - nl - fi - hr - sk - sl - et - lt license: apache-2.0 repository: https://github.com/AutoArk/open-audio-opd ---
# ARK-ASR-3B: State-of-the-Art Multilingual ASR [![GitHub](https://img.shields.io/badge/GitHub-AutoArk%2Fopen--audio--opd-blue?logo=github)](https://github.com/AutoArk/open-audio-opd) [![arXiv](https://img.shields.io/badge/arXiv-2605.28139-b31b1b?logo=arxiv)](https://arxiv.org/abs/2605.28139) [![License](https://img.shields.io/badge/License-Apache--2.0-green)](https://www.apache.org/licenses/LICENSE-2.0)
> **TL;DR** ARK-ASR-3B is a multilingual automatic speech recognition model. It achieves current state-of-the-art results on the Hugging Face Open ASR Leaderboard English short-form benchmark, with an average WER of **5.13%** across AMI, Earnings22, GigaSpeech, LibriSpeech, SPGISpeech, and VoxPopuli. The accompanying training, inference, and evaluation code is available at [AutoArk/open-audio-opd](https://github.com/AutoArk/open-audio-opd). ## Abstract ARK-ASR-3B is a 3B-scale audio-capable autoregressive Transformers model for automatic speech recognition. It combines a Whisper-style audio encoder, an MLP adapter, and a Qwen decoder with custom `arkasr` remote code. ARK-ASR currently supports Chinese, English, German, Japanese, French, Korean, Spanish, Polish, Italian, Romanian, Hungarian, Czech, Dutch, Finnish, Croatian, Slovak, Slovene, Estonian, and Lithuanian ASR. ## Supported Languages Chinese, English, German, Japanese, French, Korean, Spanish, Polish, Italian, Romanian, Hungarian, Czech, Dutch, Finnish, Croatian, Slovak, Slovene, Estonian, and Lithuanian. ## Model Overview
ARK-ASR architecture

Figure 1: ARK-ASR architecture. Audio is encoded by a Whisper-style encoder with RoPE, merged through an MLP adapter, and injected into a Qwen decoder by replacing audio placeholder token embeddings before transcript generation.

- **Model size:** 3B-scale decoder LLM with a dedicated Whisper-style audio encoder and MLP adapter - **Task:** automatic speech recognition - **Architecture:** audio-capable autoregressive Transformers model with custom `arkasr` remote code - **Checkpoint format:** `safetensors` - **Sampling rate:** 16 kHz - **Recommended inference code:** [`scripts/infer/ark_asr_transformers.py`](https://github.com/AutoArk/open-audio-opd/blob/main/scripts/infer/ark_asr_transformers.py) - **vLLM serving:** [`scripts/vllm/ark_asr_vllm`](https://github.com/AutoArk/open-audio-opd/tree/master/scripts/vllm/ark_asr_vllm) The model should be loaded with `trust_remote_code=True`. The official inference script handles the processor, tokenizer, audio prompt format, generation cleanup, and ASR token filtering. ## Performance The following results are from the Hugging Face [Open ASR Leaderboard](https://huggingface.co/datasets/hf-audio/open-asr-leaderboard). Lower WER is better. ARK-ASR-3B reaches the current state of the art on this English short-form benchmark. ### English WER | Model | AMI | Earnings22 | GigaSpeech | LS Clean | LS Other | SPGISpeech | VoxPopuli | Avg | | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | | ARK-ASR-3B | **8.91%** | **8.25%** | **7.30%** | **1.09%** | **2.41%** | **2.49%** | **5.48%** | **5.13%** | | ARK-ASR-0.6B | 10.02% | 9.77% | 8.00% | 1.53% | 3.51% | 2.63% | 6.31% | 5.97% | ## Inference Run ASR inference with Hugging Face Transformers: ```python import torch from transformers import AutoModelForCausalLM, AutoProcessor, AutoTokenizer model_path = "AutoArk-AI/ARK-ASR-3B" audio_path = "assets/libai.wav" device = "cuda" if torch.cuda.is_available() else "cpu" torch_dtype = torch.bfloat16 if device == "cuda" else torch.float32 processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True) tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( model_path, trust_remote_code=True, torch_dtype=torch_dtype, attn_implementation="sdpa", ).to(device) model.eval() def build_bad_words_ids(tokenizer): eos_ids = tokenizer.eos_token_id keep_ids = {eos_ids} if isinstance(eos_ids, int) else set(eos_ids or []) bad_ids = set(tokenizer.all_special_ids) - keep_ids bad_ids.update( token_id for token, token_id in tokenizer.get_added_vocab().items() if token.startswith("<") and token.endswith(">") and token_id not in keep_ids ) return [[token_id] for token_id in sorted(bad_ids)] conversation = [ { "role": "user", "content": [ {"type": "audio", "path": audio_path}, {"type": "text", "text": "Please transcribe this audio."}, ], } ] inputs = processor.apply_chat_template( conversation, add_generation_prompt=True, return_tensors="pt", sampling_rate=16000, audio_padding="longest", text_kwargs={"padding": "longest"}, audio_max_length=30 * 16000, ) inputs = inputs.to(device) if "audios" in inputs: inputs["audios"] = inputs["audios"].to(dtype=torch_dtype) bad_words_ids = build_bad_words_ids(tokenizer) with torch.inference_mode(): outputs = model.generate( **inputs, do_sample=False, max_new_tokens=256, pad_token_id=tokenizer.pad_token_id, eos_token_id=tokenizer.eos_token_id, bad_words_ids=bad_words_ids, ) decoded_outputs = tokenizer.batch_decode( outputs[:, inputs.input_ids.shape[1] :], skip_special_tokens=True, ) print(decoded_outputs) ``` For batch JSONL inference, use the open-source inference code: ```bash git clone https://github.com/AutoArk/open-audio-opd cd open-audio-opd pip install -e . ``` The input JSONL should contain one ASR sample per line: ```json {"audio":"/path/to/audio.wav","text":"","task":"asr","begin_time":-1,"end_time":-1} ``` ```bash python scripts/infer/ark_asr_transformers.py \ --input /path/to/input.jsonl \ --output runs/infer/predictions.jsonl \ --model_path AutoArk-AI/ARK-ASR-3B \ --processor_path AutoArk-AI/ARK-ASR-3B \ --batch_size 40 \ --dtype bfloat16 \ --attn_impl sdpa ``` The output JSONL preserves input metadata and adds: - `pred_text`: cleaned prediction text for downstream evaluation - `pred_text_raw`: raw decoded generation before cleanup ## vLLM Online Serving ARK-ASR can also be deployed as a vLLM-backed online ASR service with the adapter in [`scripts/vllm/ark_asr_vllm`](https://github.com/AutoArk/open-audio-opd/tree/master/scripts/vllm/ark_asr_vllm). The service exposes both a compact `/asr` endpoint and an OpenAI-style `/v1/audio/transcriptions` endpoint. Clone and install the serving code: ```bash git clone https://github.com/AutoArk/open-audio-opd cd open-audio-opd pip install -e ".[vllm]" ``` Start the service: ```bash MODEL=AutoArk-AI/ARK-ASR-3B \ GPU=0 \ PORT=8025 \ scripts/vllm/deploy_ark_asr_vllm_service.sh start ``` Check the service: ```bash scripts/vllm/deploy_ark_asr_vllm_service.sh status curl -sS http://127.0.0.1:8025/health curl -sS http://127.0.0.1:8025/token-mask ``` Run one transcription request: ```bash curl -sS -X POST http://127.0.0.1:8025/asr \ -F file=@/path/to/audio.wav \ -F max_new_tokens=256 ``` OpenAI-style transcription endpoint: ```bash curl -sS -X POST http://127.0.0.1:8025/v1/audio/transcriptions \ -F file=@/path/to/audio.wav \ -F model=ark-asr ``` Stop the service: ```bash scripts/vllm/deploy_ark_asr_vllm_service.sh stop ``` The vLLM adapter registers the custom `arkasr` model, loads the local processor/tokenizer with `trust_remote_code=True`, applies generation-time token masking for non-ASR control tokens, and keeps `<|im_end|>` as the stop token. Service logs and PID files are written under `runs/vllm/`. ## Evaluation The reported leaderboard numbers are evaluated with the Hugging Face [`open_asr_leaderboard`](https://github.com/huggingface/open_asr_leaderboard) evaluation code. For local J/WER evaluation, the repository also includes this entrypoint: ```bash python scripts/eval/eval_jwer_ark_asr_transformers.py \ --input /path/to/test.jsonl \ --output runs/eval/result.jsonl \ --model_path AutoArk-AI/ARK-ASR-3B \ --processor_path AutoArk-AI/ARK-ASR-3B \ --batch_size 40 \ --dtype bfloat16 \ --attn_impl sdpa ``` No evaluation audio or dataset files are bundled with this model repository. ## Acknowledgements The training code is based on [THUNLP/OPD](https://github.com/thunlp/OPD/) and [verl](https://github.com/volcengine/verl). The OPD recipe uses a stronger ASR teacher to score online student rollouts. ## Citation If you find ARK-ASR or open-audio-opd useful, please cite: ```bibtex @misc{lin2026dataefficientopd, title={Data-Efficient On-Policy Distillation for Automatic Speech Recognition}, author={Lin, Yu and Wang, Yiming and Cai, Runyuan and Zeng, Xiaodong}, year={2026}, eprint={2605.28139}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2605.28139} } ```