Instructions to use wangkanai/qwen3-vl-8b-thinking with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use wangkanai/qwen3-vl-8b-thinking with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="wangkanai/qwen3-vl-8b-thinking") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("wangkanai/qwen3-vl-8b-thinking", dtype="auto") - llama-cpp-python
How to use wangkanai/qwen3-vl-8b-thinking with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="wangkanai/qwen3-vl-8b-thinking", filename="qwen3-vl-8b-thinking-abliterated-f16.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use wangkanai/qwen3-vl-8b-thinking with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf wangkanai/qwen3-vl-8b-thinking:F16 # Run inference directly in the terminal: llama-cli -hf wangkanai/qwen3-vl-8b-thinking:F16
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf wangkanai/qwen3-vl-8b-thinking:F16 # Run inference directly in the terminal: llama-cli -hf wangkanai/qwen3-vl-8b-thinking:F16
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf wangkanai/qwen3-vl-8b-thinking:F16 # Run inference directly in the terminal: ./llama-cli -hf wangkanai/qwen3-vl-8b-thinking:F16
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf wangkanai/qwen3-vl-8b-thinking:F16 # Run inference directly in the terminal: ./build/bin/llama-cli -hf wangkanai/qwen3-vl-8b-thinking:F16
Use Docker
docker model run hf.co/wangkanai/qwen3-vl-8b-thinking:F16
- LM Studio
- Jan
- vLLM
How to use wangkanai/qwen3-vl-8b-thinking with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "wangkanai/qwen3-vl-8b-thinking" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "wangkanai/qwen3-vl-8b-thinking", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/wangkanai/qwen3-vl-8b-thinking:F16
- SGLang
How to use wangkanai/qwen3-vl-8b-thinking with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "wangkanai/qwen3-vl-8b-thinking" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "wangkanai/qwen3-vl-8b-thinking", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "wangkanai/qwen3-vl-8b-thinking" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "wangkanai/qwen3-vl-8b-thinking", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Ollama
How to use wangkanai/qwen3-vl-8b-thinking with Ollama:
ollama run hf.co/wangkanai/qwen3-vl-8b-thinking:F16
- Unsloth Studio new
How to use wangkanai/qwen3-vl-8b-thinking with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for wangkanai/qwen3-vl-8b-thinking to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for wangkanai/qwen3-vl-8b-thinking to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for wangkanai/qwen3-vl-8b-thinking to start chatting
- Pi new
How to use wangkanai/qwen3-vl-8b-thinking with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf wangkanai/qwen3-vl-8b-thinking:F16
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "wangkanai/qwen3-vl-8b-thinking:F16" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use wangkanai/qwen3-vl-8b-thinking with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf wangkanai/qwen3-vl-8b-thinking:F16
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default wangkanai/qwen3-vl-8b-thinking:F16
Run Hermes
hermes
- Docker Model Runner
How to use wangkanai/qwen3-vl-8b-thinking with Docker Model Runner:
docker model run hf.co/wangkanai/qwen3-vl-8b-thinking:F16
- Lemonade
How to use wangkanai/qwen3-vl-8b-thinking with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull wangkanai/qwen3-vl-8b-thinking:F16
Run and chat with the model
lemonade run user.qwen3-vl-8b-thinking-F16
List all available models
lemonade list
Qwen 3 VL 8B Thinking
Model Description
Qwen 3 VL 8B Thinking is a vision-language multimodal model with 8 billion parameters, part of the Qwen 3 family developed by Alibaba Cloud. This model combines vision and language understanding capabilities with extended "thinking" or reasoning capabilities for complex visual question answering and multimodal tasks.
Key Capabilities:
- 🖼️ Vision-Language Understanding: Process and understand images with natural language
- 🧠 Reasoning Capabilities: Extended thinking process for complex visual reasoning
- 💬 Multimodal Chat: Interactive conversations about images and visual content
- 🎯 Visual Question Answering: Answer questions about image content with detailed reasoning
- 📊 Scene Understanding: Comprehensive analysis of visual scenes and contexts
Repository Contents
⚠️ Note: This directory is currently being prepared for model files.
Expected model structure:
qwen3-vl-8b-thinking/
├── config.json # Model configuration
├── model.safetensors # Main model weights (~16GB)
├── tokenizer.json # Tokenizer configuration
├── tokenizer_config.json # Tokenizer settings
├── special_tokens_map.json # Special tokens mapping
├── preprocessor_config.json # Image preprocessor config
├── generation_config.json # Generation parameters
└── README.md # This file
Expected Total Size: ~16-20 GB (FP16 precision)
Hardware Requirements
Minimum Requirements
- VRAM: 20GB+ (RTX 4090, A5000, or better)
- System RAM: 32GB recommended
- Disk Space: 25GB free space
- CUDA: 11.8 or higher recommended
Recommended Requirements
- VRAM: 24GB+ (RTX 4090, A6000, A100)
- System RAM: 64GB for optimal performance
- Disk Space: 50GB for model + cache
- CUDA: 12.0+ for best performance
Performance Estimates
- FP16: ~20GB VRAM, fastest inference
- 8-bit quantization: ~10GB VRAM, good quality
- 4-bit quantization: ~6GB VRAM, acceptable quality
Usage Examples
Basic Usage with Transformers
from transformers import AutoModelForVision2Seq, AutoProcessor
from PIL import Image
import torch
# Load model and processor
model_path = "E:/huggingface/qwen3-vl-8b-thinking"
model = AutoModelForVision2Seq.from_pretrained(
model_path,
torch_dtype=torch.float16,
device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_path)
# Load and process image
image = Image.open("example.jpg")
prompt = "Describe this image in detail and explain what's happening."
# Prepare inputs
inputs = processor(
text=prompt,
images=image,
return_tensors="pt"
).to(model.device)
# Generate response with thinking
outputs = model.generate(
**inputs,
max_new_tokens=512,
do_sample=True,
temperature=0.7,
top_p=0.9
)
# Decode response
response = processor.decode(outputs[0], skip_special_tokens=True)
print(response)
Visual Question Answering
# Ask specific questions about images
questions = [
"What objects are visible in this image?",
"What is the main activity taking place?",
"What might happen next in this scene?"
]
for question in questions:
inputs = processor(
text=question,
images=image,
return_tensors="pt"
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
answer = processor.decode(outputs[0], skip_special_tokens=True)
print(f"Q: {question}")
print(f"A: {answer}\n")
Batch Processing Multiple Images
from pathlib import Path
# Process multiple images
image_dir = Path("images/")
images = [Image.open(img) for img in image_dir.glob("*.jpg")]
prompts = ["Analyze this image:"] * len(images)
# Batch processing
inputs = processor(
text=prompts,
images=images,
return_tensors="pt",
padding=True
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
responses = [processor.decode(out, skip_special_tokens=True) for out in outputs]
for img_path, response in zip(image_dir.glob("*.jpg"), responses):
print(f"\n{img_path.name}:")
print(response)
Memory-Efficient Loading (8-bit)
from transformers import BitsAndBytesConfig
# 8-bit quantization for lower VRAM usage
quantization_config = BitsAndBytesConfig(
load_in_8bit=True,
bnb_8bit_compute_dtype=torch.float16
)
model = AutoModelForVision2Seq.from_pretrained(
model_path,
quantization_config=quantization_config,
device_map="auto"
)
# Use as normal - ~50% VRAM reduction
Model Specifications
Architecture
- Base Architecture: Qwen 3 Vision-Language Transformer
- Parameters: 8 billion
- Vision Encoder: High-resolution vision transformer
- Language Model: Qwen 3 8B language backbone
- Context Length: Up to 8K tokens
- Image Resolution: Dynamic resolution support (up to 1024x1024)
Precision and Format
- Default Precision: FP16 (Float16)
- Format: SafeTensors (secure, efficient)
- Quantization Support: 8-bit, 4-bit via bitsandbytes
- Framework: PyTorch with Transformers
Training Details
- Base Model: Qwen 3 VL 8B
- Special Training: Extended reasoning/thinking capabilities
- Multimodal Alignment: Vision-language co-training
- Optimization: Instruction-tuned for visual understanding
Performance Tips
Optimization Recommendations
- Use Flash Attention 2 (if available):
model = AutoModelForVision2Seq.from_pretrained(
model_path,
torch_dtype=torch.float16,
attn_implementation="flash_attention_2",
device_map="auto"
)
- Enable Compilation (PyTorch 2.0+):
model = torch.compile(model, mode="reduce-overhead")
- Optimize Image Preprocessing:
# Resize large images before processing
from PIL import Image
def preprocess_image(img_path, max_size=1024):
img = Image.open(img_path)
if max(img.size) > max_size:
img.thumbnail((max_size, max_size), Image.Resampling.LANCZOS)
return img
Batch Similar-Sized Images: Group images by size for efficient batch processing
Use Lower Precision for Inference: FP16 or BF16 for speed, 8-bit for VRAM constraints
Memory Management
import torch
import gc
# Clear cache between batches
def clear_memory():
gc.collect()
torch.cuda.empty_cache()
# Use after processing batches
clear_memory()
License
This model is released under the Apache 2.0 License.
You are free to:
- ✅ Use commercially
- ✅ Modify and distribute
- ✅ Use privately
- ✅ Use for patents
Conditions:
- 📄 Include license and copyright notice
- 📝 State changes made to the code
- 🔓 Distribute under same license
See the Apache 2.0 License for full terms.
Citation
If you use this model in your research or applications, please cite:
@misc{qwen3-vl-8b-thinking,
title={Qwen 3 VL 8B Thinking: Vision-Language Model with Reasoning},
author={Qwen Team, Alibaba Cloud},
year={2024},
publisher={Hugging Face},
howpublished={\url{https://huggingface.co/Qwen/Qwen3-VL-8B-Thinking}}
}
Resources and Links
- 🏠 Official Website: https://qwenlm.github.io/
- 📚 Documentation: https://huggingface.co/docs/transformers/main/en/model_doc/qwen3
- 💬 Community: https://huggingface.co/Qwen
- 🐛 Issues: Report issues on the official Qwen GitHub repository
- 📖 Paper: Check Qwen technical reports for architecture details
Supported Tasks
- Visual Question Answering (VQA): Answer questions about image content
- Image Captioning: Generate detailed descriptions of images
- Visual Reasoning: Complex reasoning about visual scenes
- Multimodal Chat: Interactive conversations with image context
- Scene Understanding: Comprehensive analysis of visual contexts
- Object Recognition: Identify and describe objects in images
Model Limitations
- Image resolution limits may affect fine detail recognition
- Performance varies based on image quality and clarity
- May require fine-tuning for domain-specific applications
- Reasoning capabilities depend on prompt quality and structure
- Computational requirements may limit deployment scenarios
Safety and Responsible Use
- Review outputs for accuracy, especially in critical applications
- Be aware of potential biases in visual understanding
- Validate model responses for factual correctness
- Use appropriate safety filters for production deployments
- Consider privacy implications when processing images
Version: 1.0 Last Updated: 2025-11-05 Model Type: Vision-Language Multimodal Status: Ready for local deployment
- Downloads last month
- 23
16-bit