Running on 2 RTX Pro 6000 Blackwell GPUs at ~30 tps (Instructions that worked for me)
Prereqs
- CUDA 13.2 toolkit installed at /usr/local/cuda-13.x
- GCC 11+ (12 preferred), Python 3.12, recent pip
- At least 64GB RAM and 100GB free disk for the build (Model weights are an additional ~140 GB)
- 2 RTX PRO 6000 Blackwell GPUs
# Clean environment
python3.12 -m venv ~/vllm-build-env
source ~/vllm-build-env/bin/activate
pip install --upgrade pip wheel setuptools
# Get vLLM source
git clone https://github.com/vllm-project/vllm.git
cd vllm
# Pin to a known-good commit
git checkout c3ad791e1 # from your version string 0.20.1rc1.dev152+gc3ad791e1
# Install build-time torch matching your runtime
pip install torch==2.11.0 torchvision torchaudio \
--index-url https://download.pytorch.org/whl/cu130
# Build environment
export CUDA_HOME=/usr/local/cuda-13.0 # or wherever yours lives
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH
# THE KEY FLAG — target SM120f specifically
# Use 12.0a + 12.0f together so kernels with arch-specific (a) and family-specific (f)
# variants both get compiled. Drop other archs to make the build faster and the wheel smaller.
export TORCH_CUDA_ARCH_LIST="12.0a;12.0f"
# Optional but recommended: limit parallel jobs so you don't OOM on the host
export MAX_JOBS=8
export NVCC_THREADS=2
# Build the wheel
pip install -r requirements/build.txt
pip install -r requirements/cuda.txt
python setup.py bdist_wheel
# Result: dist/vllm-0.20.1rc1.dev152+gc3ad791e1-cp312-cp312-linux_x86_64.whl
# Test it locally first
pip install dist/vllm-*.whl
python -c "import vllm; print(vllm.__version__)"
My launch script:
#!/usr/bin/env bash
source ~/vllm-env/bin/activate
TORCH_CUDA_ARCH_LIST="12.0f" \
CUDA_HOME=/usr/local/cuda-13.2 \
vllm serve /wherever/your/models/exist/Mistral-Medium-3.5-128B \
--host 127.0.0.1 \
--port 5001 \
--served-model-name mistral-medium \
--tensor-parallel-size 2 \
--max-model-len 131072 \
--max-num-seqs 1 \
--max-num-batched-tokens 8192 \
--gpu-memory-utilization 0.90 \
--kv-cache-dtype fp8 \
--load-format mistral \
--tokenizer-mode mistral \
--config-format mistral \
--tool-call-parser mistral \
--enable-auto-tool-choice \
--reasoning-parser mistral \
--limit-mm-per-prompt '{"image": 4}' \
--speculative-config '{"model": "/wherever/your/models/exist/Mistral-Medium-3.5-128B-EAGLE", "num_speculative_tokens": 1, "method": "eagle", "draft_tensor_parallel_size": 2}'
how is it? how would you rate it compared to other models that work on 2x rtx 6000 pros?
Also, how is the KV Cache? With 196G of VRAM, and 140GB for the weights, I am curious as to how many parallel calls you are getting with this being a dense model. I have 4X H100 GPUs, currently serving an image gen model, docling, Nemotron 3 Nano Omni, Nemotron 3 Super and Gemma 4 26B (Most of these I have switched to NVFP4 to save on VRAM).
However, it might be worth it to to ditch it all, and run on 4xH100 GPUs if it can match or outperform Claude Sonnet 4.5 . I have emailed mistral about that "enterprise license" though. I am still waiting on a reply.
docker run -d --gpus all -p 8000:8000 --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 -v /mnt/models:/models -e OMP_NUM_THREADS=32 --entrypoint bash vllm/vllm-openai:nightly -c 'apt update; apt install -y git; pip install git+https://github.com/mistralai/mistral-common.git git+https://github.com/huggingface/transformers.git && exec python3 -m vllm.entrypoints.openai.api_server --model /models/mistral-medium-3.5 --tensor-parallel-size 2 --tool-call-parser mistral --enable-auto-tool-choice --reasoning-parser mistral --tokenizer-mode mistral --config-format mistral --load-format mistral --served-model-name mistral-medium --gpu_memory_utilization 0.93 --kv-cache-dtype fp8_per_token_head --max-num-seqs 2 --max-num-batched-tokens 8192 --enable-log-requests --max-log-len 65536 --default-chat-template-kwargs '\''{"reasoning_effort":"high"}'\'' --override-generation-config '\''{"temperature":0.7}'\'''
This is what I have been using, main diff is using fp8_per_token_head instead of fp8 which seems to not hit the model as hard as basic fp8
How about the eagle model?
how is it? how would you rate it compared to other models that work on 2x rtx 6000 pros?
It really depends on what use case you want it for. I will say I have been able to use it for some automated coding to reasonably okay effect. It needs substantial handholding though.
If you want it for RP it is pretty loose on the safeguards (there are some extreme lines it will not cross). It will get into a repetition loop if you let it impersonate you, but it is really good at keeping context out to to the token window so that is neat.
If I compare it to Claude Sonnet 4.5 it is not quite as good, but it is impressive for a 120B model.
Also, how is the KV Cache? With 196G of VRAM, and 140GB for the weights, I am curious as to how many parallel calls you are getting with this being a dense model. I have 4X H100 GPUs, currently serving an image gen model, docling, Nemotron 3 Nano Omni, Nemotron 3 Super and Gemma 4 26B (Most of these I have switched to NVFP4 to save on VRAM).
However, it might be worth it to to ditch it all, and run on 4xH100 GPUs if it can match or outperform Claude Sonnet 4.5 . I have emailed mistral about that "enterprise license" though. I am still waiting on a reply.
Having tested a lot of the locally runnable models, Nemotron 3 Super is faster, but not as good at most tasks, Gemma4 31B (I am not very impressed with the 26B) is pretty good but it hallucinates way too much when I give it a lot of options. I actually think a Mistral Orchestrator and Gemma4 subagent (with very well defined tasks) would be an excellent setup for something local. (This is sadly outside of my memory limit.) It isn't quite at Sonnet level.
In my experience, DS4 Flash is faster for a similar experience, but Mistral 3.5 Medium is actually better at retaining its context and doesn't get lost as easily on long context tasks.
Gemma4 31B is a very competent model for simple tasks and is very good at doing recall tasks, but it does get a little too creative, I think. It gets incredibly annoying to have to fall back and fix it's problems, something I have to do about 1/10th of the time with Mistral.
Nemotron 3 Super is not as good of a model as DS4 Flash or Mistral 3.5, but it handles longer context better than DS4, but not as good as M3.5. It is faster than Mistral by quite a bit though... so there could be a use case for it.
Whats the speed (token/s) with your configuration?
Can you try increasing "num_speculative_tokens": 1 and post your results please so we don't have to test this :)
| Workload | Generation throughput | Mean acceptance length | Avg draft acceptance rate | Per-position acceptance |
|---|---|---|---|---|
| General prose | 26–35 tok/s | 1.89–1.98 | 45–49% | ~60% / ~32% |
| Coding | 37–43 tok/s | 2.34–2.70 | 67–85% | 74–91% / 59–80% |
I don't think Spec Decode 3 is useful for storytelling, but it could be useful for coding.
Why wouldnt it be useful to predict the next 3 tokens instead of the next token?
Also running through the small spec decoding model will have a small amount of overhead. You will probably see much larger benefit with higher context.
But maybe you took this into consideration.
I did take this into consideration. The table is a result for Spec Decode 2, that is why there are 2 positions on the per-position acceptance. That's on me for not being clear. Spec Decode 1 was neat because it actually worked, but I think 2 is probably the sweet spot unless you are all in on coding. You get a slight drop in performance with Spec Decode 3 on storytelling, and I haven't tested it too extensively for coding, but it would probably be a gain based on the trend. I would expect to see something like 20%-50% acceptance on the third token.
I guess what I am actually saying is:
Spec Decode 2 is good for storytelling
Spec Decode 3 might be worth considering for coding.
I don't really see any benefit at longer contexts. I actually see Spec Decode acceptance rate drop slightly over the context, and it is probably because I am adding to the context as a human being and I am not the same as whatever it is trained on.