SeerAttention-R
Collection
Decode AttnGate for Reasoning Models • 4 items • Updated
How to use SeerAttention/SeerAttention-Decode-Qwen3-14B-AttnGates with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("text-generation", model="SeerAttention/SeerAttention-Decode-Qwen3-14B-AttnGates") # Load model directly
from transformers import AutoTokenizer, AutoModelForMultimodalLM
tokenizer = AutoTokenizer.from_pretrained("SeerAttention/SeerAttention-Decode-Qwen3-14B-AttnGates")
model = AutoModelForMultimodalLM.from_pretrained("SeerAttention/SeerAttention-Decode-Qwen3-14B-AttnGates")How to use SeerAttention/SeerAttention-Decode-Qwen3-14B-AttnGates with vLLM:
# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "SeerAttention/SeerAttention-Decode-Qwen3-14B-AttnGates"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "SeerAttention/SeerAttention-Decode-Qwen3-14B-AttnGates",
"prompt": "Once upon a time,",
"max_tokens": 512,
"temperature": 0.5
}'docker model run hf.co/SeerAttention/SeerAttention-Decode-Qwen3-14B-AttnGates
How to use SeerAttention/SeerAttention-Decode-Qwen3-14B-AttnGates with SGLang:
# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
--model-path "SeerAttention/SeerAttention-Decode-Qwen3-14B-AttnGates" \
--host 0.0.0.0 \
--port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "SeerAttention/SeerAttention-Decode-Qwen3-14B-AttnGates",
"prompt": "Once upon a time,",
"max_tokens": 512,
"temperature": 0.5
}'docker run --gpus all \
--shm-size 32g \
-p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=<secret>" \
--ipc=host \
lmsysorg/sglang:latest \
python3 -m sglang.launch_server \
--model-path "SeerAttention/SeerAttention-Decode-Qwen3-14B-AttnGates" \
--host 0.0.0.0 \
--port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "SeerAttention/SeerAttention-Decode-Qwen3-14B-AttnGates",
"prompt": "Once upon a time,",
"max_tokens": 512,
"temperature": 0.5
}'How to use SeerAttention/SeerAttention-Decode-Qwen3-14B-AttnGates with Docker Model Runner:
docker model run hf.co/SeerAttention/SeerAttention-Decode-Qwen3-14B-AttnGates
This repo contains the decode stage AttnGate weights from paper SeerAttention-R. The current support models are:
Results of reasoning task with different token budgets. All the results are the averaged pass@1 results with 64 sample per query for AIME, 16 samples for GPQA, and 8 samples for MATH-500.
| Model | 2k | 4k | 6k | 8k | Full Attention |
|---|---|---|---|---|---|
| Qwen3-4B | 55.42 | 68.75 | 70.94 | 72.50 | 71.25 |
| Qwen3-8B | 56.56 | 72.29 | 74.22 | 75.05 | 74.48 |
| Qwen3-14B | 62.24 | 75.78 | 78.02 | 78.65 | 78.91 |
| DeepSeek-R1-Distill-Qwen-14B | 55.78 | 66.35 | 67.50 | 66.82 | 67.50 |
| Model | 2k | 4k | 6k | 8k | Full Attention |
|---|---|---|---|---|---|
| Qwen3-4B | 45.73 | 57.60 | 60.20 | 62.90 | 66.41 |
| Qwen3-8B | 42.60 | 56.77 | 60.31 | 64.17 | 67.86 |
| Qwen3-14B | 46.67 | 62.66 | 67.19 | 69.01 | 70.21 |
| DeepSeek-R1-Distill-Qwen-14B | 38.44 | 47.19 | 52.25 | 50.05 | 50.00 |
| Model | 1k | 2k | 4k | 6k | Full Attention |
|---|---|---|---|---|---|
| Qwen3-4B | 84.80 | 92.20 | 93.60 | 93.60 | 93.93 |
| Qwen3-8B | 82.82 | 91.53 | 94.17 | 94.53 | 94.43 |
| Qwen3-14B | 85.13 | 93.20 | 94.77 | 94.80 | 95.22 |
| DeepSeek-R1-Distill-Qwen-14B | 87.65 | 92.10 | 93.05 | 93.12 | 93.30 |
| Model | 1k | 2k | 4k | 6k | Full Attention |
|---|---|---|---|---|---|
| Qwen3-4B | 39.61 | 51.20 | 55.20 | 55.90 | 56.19 |
| Qwen3-8B | 37.59 | 54.32 | 59.60 | 60.48 | 60.54 |
| Qwen3-14B | 44.54 | 59.72 | 63.76 | 64.20 | 65.25 |
| DeepSeek-R1-Distill-Qwen-14B | 51.26 | 56.79 | 56.41 | 57.48 | 57.80 |