Instructions to use WorldSeek-AI/WorldSeek-Omni-2B-Preview with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use WorldSeek-AI/WorldSeek-Omni-2B-Preview with Transformers:
# Load model directly from transformers import AutoModelForSeq2SeqLM model = AutoModelForSeq2SeqLM.from_pretrained("WorldSeek-AI/WorldSeek-Omni-2B-Preview", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
WorldSeek-Omni-2B-Preview
Language: English | 简体中文
This repository contains the model weights, configuration files, and inference scripts for WorldSeek-Omni-2B-Preview.
This is a preview release intended for research, evaluation, and development integration. Complete benchmark results, dataset documentation, and model-card details for the final release will be added in a later update.
WorldSeek-Omni-2B-Preview is an early checkpoint of WorldSeek-Omni-2B. The full training process has not yet been completed, so capabilities and benchmark results should be interpreted as preview-stage behavior rather than final model performance.
WorldSeek-Omni-2B-Preview is an omni model built on Qwen3.5-2B and Qwen3-ASR-1.7B. It preserves the text and vision-language capabilities of Qwen3.5-2B while integrating the AuT audio encoder from Qwen3-ASR-1.7B for speech input and speech recognition tasks.
Resources
- Official Website: https://www.worldseek-ai.com/
- ModelScope: https://modelscope.cn/models/WorldSeek-AI/WorldSeek-Omni-2B-Preview
- Hugging Face: https://huggingface.co/WorldSeek-AI/WorldSeek-Omni-2B-Preview
- GitHub: https://github.com/WorldSeek-AI/WorldSeek-Omni-2B-Preview
Highlights
- Omni input support: Supports text, image, and audio inputs. Video input can follow the Qwen3.5 multimodal message format when supported by the serving backend.
- Qwen3.5-2B backbone: Text, image, and OpenAI-compatible multimodal usage follow the Qwen3.5 model family.
- Qwen3-ASR-1.7B audio tower: The audio encoder is adapted from Qwen3-ASR-1.7B and connected to the Qwen3.5-2B language model.
- Chinese and English ASR: The preview release currently focuses on Chinese and English speech recognition. Quality is still being improved, and multilingual capabilities will be expanded in future releases.
- Route-based inference:
omniis used for the base omni path, andasris used for the ASR path with the ASR LoRA enabled. - Multiple inference backends: The package includes vLLM OpenAI-compatible serving and standalone Hugging Face local CLI / FastAPI serving scripts.
Architecture Highlights
WorldSeek-Omni-2B-Preview combines the AuT audio encoder from Qwen3-ASR-1.7B (Qwen3-AuT-300M) with the language and vision backbone of Qwen3.5-2B. Audio inputs are first encoded into speech representations by AuT, then adapted into the Qwen3.5-2B language model through the WorldSeek-Omni bridge. This design preserves the text and vision-language behavior of Qwen3.5 while enabling speech input and ASR decoding.
Model Overview
| Feature | Value |
|---|---|
| Type | Omni / Multimodal Causal Language Model |
| Parameters | 2B |
| Language and Vision Backbone | Qwen3.5-2B |
| Audio Encoder | Qwen3-AuT-300M |
| Maximum Sequence Length | 262144 |
| Maximum Audio Input | 60s |
| Supported Input Types | Text, image, video, audio |
| Current ASR Languages | Chinese, English |
| Inference Routes | omni, asr |
| Supported Inference Backends | vLLM (v0.21.0) / Hugging Face Transformers (>=4.57.0) |
| Release Status | Preview / early checkpoint |
| License | Apache 2.0 |
Benchmark Results
The following figures summarize the current preview-stage benchmark results. ASR values report CER / WER where applicable; text, vision, and video benchmarks report accuracy-style scores where higher is better. Since this release is an early checkpoint, these results should not be interpreted as final model performance.
ASR Benchmark
Text, Vision and Video Benchmark
Installation
We recommend using an isolated Python environment. For vLLM serving, install vLLM 0.21.0 or a compatible build, and follow the official vLLM installation guide for a CUDA / PyTorch build that matches your system.
# vLLM serving
pip install "vllm>=0.21.0"
For the vLLM request example script, also install:
pip install requests
For Hugging Face local inference, install the following basic dependencies:
pip install torch "transformers>=4.57.0" safetensors soundfile pillow
For the Hugging Face FastAPI resident server, also install:
pip install fastapi uvicorn python-multipart
Quickstart
WorldSeek-Omni-2B-Preview uses two model names to distinguish request routes:
omni -> base omni route
asr -> base model + ASR LoRA route
Speech recognition requests should explicitly use model="asr" or --route asr.
vLLM Serving
The following command starts an OpenAI-compatible API server at http://localhost:8000/v1:
MODEL_PATH="/path/to/WorldSeek-Omni-2B-Preview"
vllm serve "${MODEL_PATH}" \
--host 0.0.0.0 \
--port 8000 \
--gpu-memory-utilization 0.3 \
--served-model-name "omni" \
--dtype "bfloat16" \
--max-model-len 32768 \
--kv-cache-dtype fp8 \
--limit-mm-per-prompt '{"image":1,"video":1,"audio":1}' \
--trust-remote-code \
--enable-lora \
--lora-modules asr="${MODEL_PATH}" \
--max-lora-rank 256
Check available models:
curl http://localhost:8000/v1/models
Expected model names:
omni
asr
Hugging Face Local Inference
The Hugging Face CLI entrypoint is intended for local validation, debugging, and Windows CUDA use cases:
python hf_omni_inference.py \
--model-dir /path/to/WorldSeek-Omni-2B-Preview \
--route omni \
--text "Give me a short introduction to yourself."
Audio transcription:
python hf_omni_inference.py \
--model-dir /path/to/WorldSeek-Omni-2B-Preview \
--route asr \
--audio-file /path/to/audio.wav
Image understanding:
python hf_omni_inference.py \
--model-dir /path/to/WorldSeek-Omni-2B-Preview \
--route omni \
--image-file /path/to/image.png \
--text "Describe this image."
Hugging Face FastAPI Serving
The following command starts a local OpenAI-compatible API server at http://localhost:3004/v1:
python hf_omni_server.py \
--model-dir /path/to/WorldSeek-Omni-2B-Preview \
--host 0.0.0.0 \
--port 3004 \
--device cuda
API Examples
The Chat Completions API is accessible through standard HTTP requests or the OpenAI Python SDK. The following examples use the vLLM server by default:
pip install -U openai
export OPENAI_BASE_URL="http://localhost:8000/v1"
export OPENAI_API_KEY="EMPTY"
If you are using the Hugging Face FastAPI server, set OPENAI_BASE_URL to http://localhost:3004/v1 instead.
Text Input
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="omni",
messages=[
{"role": "user", "content": "Give me a short introduction to large language models."}
],
temperature=0.7,
max_tokens=512,
)
print(response.choices[0].message.content)
Image Input
Use model="omni" for image understanding. Images can be passed as URLs or base64 data URLs.
import base64
from pathlib import Path
from openai import OpenAI
client = OpenAI()
image_path = Path("/path/to/image.png")
image_b64 = base64.b64encode(image_path.read_bytes()).decode("utf-8")
response = client.chat.completions.create(
model="omni",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": f"data:image/png;base64,{image_b64}"
},
},
{"type": "text", "text": "Describe this image."},
],
}
],
temperature=0,
max_tokens=512,
)
print(response.choices[0].message.content)
Audio Transcription
Use model=asr for speech recognition:
curl http://localhost:8000/v1/audio/transcriptions \
-F model=asr \
-F file=@/path/to/audio.wav
Audio Chat
Audio can also be sent as a multimodal Chat Completions message. This path also requires model="asr".
import base64
from pathlib import Path
from openai import OpenAI
client = OpenAI()
audio_path = Path("/path/to/audio.wav")
audio_b64 = base64.b64encode(audio_path.read_bytes()).decode("utf-8")
response = client.chat.completions.create(
model="asr",
messages=[
{
"role": "user",
"content": [
{
"type": "audio_url",
"audio_url": {
"url": f"data:audio/wav;base64,{audio_b64}"
},
},
{"type": "text", "text": "Please transcribe this audio."},
],
}
],
temperature=0,
max_tokens=512,
)
print(response.choices[0].message.content)
Included Scripts
| Script | Purpose |
|---|---|
hf_omni_inference.py |
Hugging Face local CLI inference for text, image understanding, and audio transcription. |
hf_omni_server.py |
Lightweight Hugging Face FastAPI server with OpenAI-compatible /v1/chat/completions and /v1/audio/transcriptions endpoints. |
vllm_route_request_examples.py |
vLLM OpenAI-compatible request examples for the omni and asr routes. |
Common script examples:
# Hugging Face local text inference
python hf_omni_inference.py \
--model-dir /path/to/WorldSeek-Omni-2B-Preview \
--route omni \
--text "Give me a short introduction to yourself."
# Hugging Face local audio transcription
python hf_omni_inference.py \
--model-dir /path/to/WorldSeek-Omni-2B-Preview \
--route asr \
--audio-file /path/to/audio.wav
# Hugging Face OpenAI-compatible local server
python hf_omni_server.py \
--model-dir /path/to/WorldSeek-Omni-2B-Preview \
--host 0.0.0.0 \
--port 3004 \
--device cuda
# vLLM request examples
python vllm_route_request_examples.py \
--base-url http://localhost:8000/v1 \
--audio-file /path/to/audio.wav \
--asr
Limitations
- This is a preview release, and benchmark results remain subject to update as training and evaluation continue.
- WorldSeek-Omni-2B-Preview is an early checkpoint; the full WorldSeek-Omni-2B training process is still in progress.
- Current ASR support focuses on Chinese and English. Additional languages will be expanded in future releases.
- ASR requests require the
asrroute. - Timestamp prediction and streaming ASR are not included in this preview package.
Acknowledgements
WorldSeek-Omni-2B-Preview is built on Qwen3.5-2B and Qwen3-ASR-1.7B. We thank the Qwen Team and Tongyi Lab for releasing the Qwen model family.
Citation
If you use this model in your research or projects, please cite this model release and the relevant upstream works:
@misc{worldseek_omni_2b_preview,
title = {WorldSeek-Omni-2B-Preview},
author = {{WorldSeek Team}},
year = {2026},
howpublished = {Preview model release},
url = {https://huggingface.co/WorldSeek-AI/WorldSeek-Omni-2B-Preview},
note = {Code: https://github.com/WorldSeek-AI/WorldSeek-Omni-2B-Preview; ModelScope: https://modelscope.cn/models/WorldSeek-AI/WorldSeek-Omni-2B-Preview}
}
@misc{qwen3.5,
title = {{Qwen3.5}: Towards Native Multimodal Agents},
author = {{Qwen Team}},
month = {February},
year = {2026},
url = {https://qwen.ai/blog?id=qwen3.5}
}
@article{Qwen3-ASR,
title={Qwen3-ASR Technical Report},
author={Xian Shi, Xiong Wang, Zhifang Guo, Yongqi Wang, Pei Zhang, Xinyu Zhang, Zishan Guo, Hongkun Hao, Yu Xi, Baosong Yang, Jin Xu, Jingren Zhou, Junyang Lin},
journal={arXiv preprint arXiv:2601.21337},
year={2026}
}
License
This model is released under the Apache 2.0 License. Please also comply with the licenses and terms of the upstream models and data resources.
- Downloads last month
- 18
Model tree for WorldSeek-AI/WorldSeek-Omni-2B-Preview
Base model
Qwen/Qwen3-ASR-1.7B
