WorldSeek-Omni-2B-Preview

Language: English | 简体中文

WorldSeek logo

WorldSeek-Omni banner

This repository contains the model weights, configuration files, and inference scripts for WorldSeek-Omni-2B-Preview.

This is a preview release intended for research, evaluation, and development integration. Complete benchmark results, dataset documentation, and model-card details for the final release will be added in a later update.

WorldSeek-Omni-2B-Preview is an early checkpoint of WorldSeek-Omni-2B. The full training process has not yet been completed, so capabilities and benchmark results should be interpreted as preview-stage behavior rather than final model performance.

WorldSeek-Omni-2B-Preview is an omni model built on Qwen3.5-2B and Qwen3-ASR-1.7B. It preserves the text and vision-language capabilities of Qwen3.5-2B while integrating the AuT audio encoder from Qwen3-ASR-1.7B for speech input and speech recognition tasks.

Resources

Highlights

  • Omni input support: Supports text, image, and audio inputs. Video input can follow the Qwen3.5 multimodal message format when supported by the serving backend.
  • Qwen3.5-2B backbone: Text, image, and OpenAI-compatible multimodal usage follow the Qwen3.5 model family.
  • Qwen3-ASR-1.7B audio tower: The audio encoder is adapted from Qwen3-ASR-1.7B and connected to the Qwen3.5-2B language model.
  • Chinese and English ASR: The preview release currently focuses on Chinese and English speech recognition. Quality is still being improved, and multilingual capabilities will be expanded in future releases.
  • Route-based inference: omni is used for the base omni path, and asr is used for the ASR path with the ASR LoRA enabled.
  • Multiple inference backends: The package includes vLLM OpenAI-compatible serving and standalone Hugging Face local CLI / FastAPI serving scripts.

Architecture Highlights

WorldSeek-Omni-2B-Preview combines the AuT audio encoder from Qwen3-ASR-1.7B (Qwen3-AuT-300M) with the language and vision backbone of Qwen3.5-2B. Audio inputs are first encoded into speech representations by AuT, then adapted into the Qwen3.5-2B language model through the WorldSeek-Omni bridge. This design preserves the text and vision-language behavior of Qwen3.5 while enabling speech input and ASR decoding.

WorldSeek-Omni architecture

Model Overview

Feature Value
Type Omni / Multimodal Causal Language Model
Parameters 2B
Language and Vision Backbone Qwen3.5-2B
Audio Encoder Qwen3-AuT-300M
Maximum Sequence Length 262144
Maximum Audio Input 60s
Supported Input Types Text, image, video, audio
Current ASR Languages Chinese, English
Inference Routes omni, asr
Supported Inference Backends vLLM (v0.21.0) / Hugging Face Transformers (>=4.57.0)
Release Status Preview / early checkpoint
License Apache 2.0

Benchmark Results

The following figures summarize the current preview-stage benchmark results. ASR values report CER / WER where applicable; text, vision, and video benchmarks report accuracy-style scores where higher is better. Since this release is an early checkpoint, these results should not be interpreted as final model performance.

ASR Benchmark

ASR benchmark comparison

Text, Vision and Video Benchmark

Text, vision, and video benchmark comparison

Installation

We recommend using an isolated Python environment. For vLLM serving, install vLLM 0.21.0 or a compatible build, and follow the official vLLM installation guide for a CUDA / PyTorch build that matches your system.

# vLLM serving
pip install "vllm>=0.21.0"

For the vLLM request example script, also install:

pip install requests

For Hugging Face local inference, install the following basic dependencies:

pip install torch "transformers>=4.57.0" safetensors soundfile pillow

For the Hugging Face FastAPI resident server, also install:

pip install fastapi uvicorn python-multipart

Quickstart

WorldSeek-Omni-2B-Preview uses two model names to distinguish request routes:

omni -> base omni route
asr  -> base model + ASR LoRA route

Speech recognition requests should explicitly use model="asr" or --route asr.

vLLM Serving

The following command starts an OpenAI-compatible API server at http://localhost:8000/v1:

MODEL_PATH="/path/to/WorldSeek-Omni-2B-Preview"

vllm serve "${MODEL_PATH}" \
  --host 0.0.0.0 \
  --port 8000 \
  --gpu-memory-utilization 0.3 \
  --served-model-name "omni" \
  --dtype "bfloat16" \
  --max-model-len 32768 \
  --kv-cache-dtype fp8 \
  --limit-mm-per-prompt '{"image":1,"video":1,"audio":1}' \
  --trust-remote-code \
  --enable-lora \
  --lora-modules asr="${MODEL_PATH}" \
  --max-lora-rank 256

Check available models:

curl http://localhost:8000/v1/models

Expected model names:

omni
asr

Hugging Face Local Inference

The Hugging Face CLI entrypoint is intended for local validation, debugging, and Windows CUDA use cases:

python hf_omni_inference.py \
  --model-dir /path/to/WorldSeek-Omni-2B-Preview \
  --route omni \
  --text "Give me a short introduction to yourself."

Audio transcription:

python hf_omni_inference.py \
  --model-dir /path/to/WorldSeek-Omni-2B-Preview \
  --route asr \
  --audio-file /path/to/audio.wav

Image understanding:

python hf_omni_inference.py \
  --model-dir /path/to/WorldSeek-Omni-2B-Preview \
  --route omni \
  --image-file /path/to/image.png \
  --text "Describe this image."

Hugging Face FastAPI Serving

The following command starts a local OpenAI-compatible API server at http://localhost:3004/v1:

python hf_omni_server.py \
  --model-dir /path/to/WorldSeek-Omni-2B-Preview \
  --host 0.0.0.0 \
  --port 3004 \
  --device cuda

API Examples

The Chat Completions API is accessible through standard HTTP requests or the OpenAI Python SDK. The following examples use the vLLM server by default:

pip install -U openai
export OPENAI_BASE_URL="http://localhost:8000/v1"
export OPENAI_API_KEY="EMPTY"

If you are using the Hugging Face FastAPI server, set OPENAI_BASE_URL to http://localhost:3004/v1 instead.

Text Input

from openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(
    model="omni",
    messages=[
        {"role": "user", "content": "Give me a short introduction to large language models."}
    ],
    temperature=0.7,
    max_tokens=512,
)

print(response.choices[0].message.content)

Image Input

Use model="omni" for image understanding. Images can be passed as URLs or base64 data URLs.

import base64
from pathlib import Path
from openai import OpenAI

client = OpenAI()

image_path = Path("/path/to/image.png")
image_b64 = base64.b64encode(image_path.read_bytes()).decode("utf-8")

response = client.chat.completions.create(
    model="omni",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/png;base64,{image_b64}"
                    },
                },
                {"type": "text", "text": "Describe this image."},
            ],
        }
    ],
    temperature=0,
    max_tokens=512,
)

print(response.choices[0].message.content)

Audio Transcription

Use model=asr for speech recognition:

curl http://localhost:8000/v1/audio/transcriptions \
  -F model=asr \
  -F file=@/path/to/audio.wav

Audio Chat

Audio can also be sent as a multimodal Chat Completions message. This path also requires model="asr".

import base64
from pathlib import Path
from openai import OpenAI

client = OpenAI()

audio_path = Path("/path/to/audio.wav")
audio_b64 = base64.b64encode(audio_path.read_bytes()).decode("utf-8")

response = client.chat.completions.create(
    model="asr",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "audio_url",
                    "audio_url": {
                        "url": f"data:audio/wav;base64,{audio_b64}"
                    },
                },
                {"type": "text", "text": "Please transcribe this audio."},
            ],
        }
    ],
    temperature=0,
    max_tokens=512,
)

print(response.choices[0].message.content)

Included Scripts

Script Purpose
hf_omni_inference.py Hugging Face local CLI inference for text, image understanding, and audio transcription.
hf_omni_server.py Lightweight Hugging Face FastAPI server with OpenAI-compatible /v1/chat/completions and /v1/audio/transcriptions endpoints.
vllm_route_request_examples.py vLLM OpenAI-compatible request examples for the omni and asr routes.

Common script examples:

# Hugging Face local text inference
python hf_omni_inference.py \
  --model-dir /path/to/WorldSeek-Omni-2B-Preview \
  --route omni \
  --text "Give me a short introduction to yourself."

# Hugging Face local audio transcription
python hf_omni_inference.py \
  --model-dir /path/to/WorldSeek-Omni-2B-Preview \
  --route asr \
  --audio-file /path/to/audio.wav

# Hugging Face OpenAI-compatible local server
python hf_omni_server.py \
  --model-dir /path/to/WorldSeek-Omni-2B-Preview \
  --host 0.0.0.0 \
  --port 3004 \
  --device cuda

# vLLM request examples
python vllm_route_request_examples.py \
  --base-url http://localhost:8000/v1 \
  --audio-file /path/to/audio.wav \
  --asr

Limitations

  • This is a preview release, and benchmark results remain subject to update as training and evaluation continue.
  • WorldSeek-Omni-2B-Preview is an early checkpoint; the full WorldSeek-Omni-2B training process is still in progress.
  • Current ASR support focuses on Chinese and English. Additional languages will be expanded in future releases.
  • ASR requests require the asr route.
  • Timestamp prediction and streaming ASR are not included in this preview package.

Acknowledgements

WorldSeek-Omni-2B-Preview is built on Qwen3.5-2B and Qwen3-ASR-1.7B. We thank the Qwen Team and Tongyi Lab for releasing the Qwen model family.

Citation

If you use this model in your research or projects, please cite this model release and the relevant upstream works:

@misc{worldseek_omni_2b_preview,
  title        = {WorldSeek-Omni-2B-Preview},
  author       = {{WorldSeek Team}},
  year         = {2026},
  howpublished = {Preview model release},
  url          = {https://huggingface.co/WorldSeek-AI/WorldSeek-Omni-2B-Preview},
  note         = {Code: https://github.com/WorldSeek-AI/WorldSeek-Omni-2B-Preview; ModelScope: https://modelscope.cn/models/WorldSeek-AI/WorldSeek-Omni-2B-Preview}
}

@misc{qwen3.5,
    title  = {{Qwen3.5}: Towards Native Multimodal Agents},
    author = {{Qwen Team}},
    month  = {February},
    year   = {2026},
    url    = {https://qwen.ai/blog?id=qwen3.5}
}

@article{Qwen3-ASR,
  title={Qwen3-ASR Technical Report},
  author={Xian Shi, Xiong Wang, Zhifang Guo, Yongqi Wang, Pei Zhang, Xinyu Zhang, Zishan Guo, Hongkun Hao, Yu Xi, Baosong Yang, Jin Xu, Jingren Zhou, Junyang Lin},
  journal={arXiv preprint arXiv:2601.21337},
  year={2026}
}

License

This model is released under the Apache 2.0 License. Please also comply with the licenses and terms of the upstream models and data resources.

Downloads last month
18
Safetensors
Model size
3B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for WorldSeek-AI/WorldSeek-Omni-2B-Preview

Finetuned
(59)
this model

Paper for WorldSeek-AI/WorldSeek-Omni-2B-Preview