what keys to run with vllm?

#1
by IIOXMEJI - opened

what keys to run with vllm?
recommended for qwen3.6?

Hey @IlOXMEJI! Thanks for trying the model!

This particular repo is the MLX version -- designed for Apple Silicon (Mac) inference using mlx_lm. It's not directly compatible with vLLM.

For vLLM / GPU inference, you'll want the GGUF version instead:
https://huggingface.co/deadbydawn101/RavenX-CyberAgent-Qwen3.6-35B-A3B-Opus-4.7-OpenMythos-Pentester-BugHunter-RATH-GGUF

To run on Apple Silicon (this repo):

pip install mlx-lm
python -m mlx_lm chat --model deadbydawn101/RavenX-CyberAgent-Qwen3.6-35B-A3B-Opus-4.7-OpenMythos-Pentester-BugHunter-RATH-mlx

To run with llama.cpp / Ollama (GGUF repo):

llama-server -m RavenX-CyberAgent-*-Q4_K_M.gguf --host 0.0.0.0 --port 8080 -c 8192 --chat-template chatml

For vLLM specifically, you'd need to convert the MLX model to HuggingFace format first using https://github.com/pccr10001/mlx-to-hf-qwen35moe -- community member @pccr10001 built this converter specifically for Qwen3.5-MoE MLX models.

Let us know how it goes.

Sign up or log in to comment