---
license: apache-2.0
language:
- zh
- en
pipeline_tag: text-generation
tags:
- speculative-decoding
- eagle
- qwen-vl
base_model:
- Qwen/Qwen3-VL-2B-Instruct
---

<!-- 语言切换 / Language Toggle -->
<div align="center">
<a href="#-en">English</a> | <a href="#-zh-cn">中文</a>
</div>

<!-- 英文版 README -->
<div id="-en">

# EAGLE-3 Draft Model for Qwen3-VL-2B-Instruct

## Model Overview

This repository contains an **EAGLE-3 style draft model** specifically trained to accelerate the inference of the `Qwen3-VL-2B-Instruct` large language model.

This is **not a standalone model**. It must be used in conjunction with its corresponding base model (`Qwen3-VL-2B-Instruct`) within a speculative decoding framework to achieve significant speedups in text generation.

- **Base Model:** `Qwen3-VL-2B-Instruct`
- **Model Architecture:** EAGLE-3 (Speculative Decoding Draft Model)
- **Primary Benefit:** Accelerates text generation throughput by 1.5x to 2.5x without compromising the generation quality of the base model.

## What is EAGLE?

EAGLE (Extrapolative A* Generative Language Engine) is an advanced speculative decoding method. It uses a small draft model to generate a sequence of draft tokens in parallel. These tokens are then verified by the larger, more powerful base model in a single forward pass. If the draft is accepted, the generation process advances multiple steps at once, leading to a substantial increase in speed.

This model serves as the "draft model" in this process. Its average acceptance length (`acc_length`) on standard benchmarks is approximately **1.87 tokens** (with 4 draft tokens), meaning on average, it helps the base model advance nearly 2 tokens per verification step.

## Performance

This model was evaluated on a diverse set of benchmarks. The `acc_length` (average number of accepted draft tokens) indicates the efficiency of the acceleration. A higher value is better.

| Benchmark  | `acc_length` (num_draft_tokens=4) | `acc_length` (num_draft_tokens=8) |
| :--------- | :-------------------------------: | :-------------------------------: |
| humaneval  |               2.12                |               2.30                |
| math500    |               2.11                |               2.27                |
| ceval      |               1.86                |               1.97                |
| cmmlu      |               1.84                |               1.97                |
| gsm8k      |               1.83                |               1.88                |
| mtbench    |               1.79                |               1.86                |
| **Average**|             **~1.93**             |             **~2.04**             |

These results demonstrate consistent and effective acceleration across various tasks, including coding, math, and general conversation.

## Training Details

- **Training Framework:** This model was trained using **[SpecForge](https://github.com/sgl-project/SpecForge)**, an open-source framework for speculative decoding research.
- **Training Data:** The model was trained on the **EagleChat** dataset. Available on [Hugging Face](https://huggingface.co/datasets/zhaode/EagleChat) and [ModelScope](https://modelscope.cn/datasets/zhaode/EagleChat).
- **Training Duration:** The model was trained for 2 epochs on 4x H20 GPUs, which took 27 hours and totaled 108 `H20 GPU-hours`.

</div>

---

<!-- 中文版 README -->
<div id="-zh-cn">

# 适用于 Qwen3-VL-2B-Instruct 的 EAGLE-3 草稿模型

## 模型简介

本仓库包含一个 **EAGLE-3 风格的草稿模型**，专为加速 `Qwen3-VL-2B-Instruct` 大语言模型的推理而训练。

请注意：这是一个**非独立模型**。它必须与对应的基座模型 (`Qwen3-VL-2B-Instruct`) 在推测解码 (speculative decoding) 框架下配合使用，才能实现显著的文本生成加速效果。

- **基座模型:** `Qwen3-VL-2B-Instruct`
- **模型架构:** EAGLE-3 (推测解码草稿模型)
- **核心优势:** 在不牺牲基座模型生成质量的前提下，将文本生成吞吐量提升 1.5 到 2.5 倍。

## 什么是 EAGLE？

EAGLE (Extrapolative A* Generative Language Engine) 是一种先进的推测解码方法。它利用一个轻量的草稿模型并行生成一系列草稿词元 (draft tokens)，然后由更大、更强的基座模型通过单次前向传播进行验证。如果草稿被接受，生成过程就能一次性前进多个步骤，从而实现显著的速度提升。

本模型在此过程中扮演“草稿模型”的角色。它在标准评测基准上的平均接受长度 (`acc_length`) 约为 **1.87 个词元** (在草稿长度为4时)，这意味着在每次验证中，它平均能帮助基座模型推进接近 2 个词元。

## 性能表现

本模型在一系列多样化的评测基准上进行了评估。`acc_length` (平均接受的草稿词元数) 反映了加速的效率，数值越高越好。

| 评测基准 (Benchmark) | `acc_length` (num_draft_tokens=4) | `acc_length` (num_draft_tokens=8) |
| :------------------ | :-------------------------------: | :-------------------------------: |
| humaneval           |               2.12                |               2.30                |
| math500             |               2.11                |               2.27                |
| ceval               |               1.86                |               1.97                |
| cmmlu               |               1.84                |               1.97                |
| gsm8k               |               1.83                |               1.88                |
| mtbench             |               1.79                |               1.86                |
| **平均值**           |             **~1.93**             |             **~2.04**             |


这些结果表明，该模型在编码、数学和通用对话等不同任务上都能提供稳定且高效的加速效果。

## 训练细节

- **训练框架:** 本模型使用开源推测解码研究框架 **[SpecForge](https://github.com/sgl-project/SpecForge)** 进行训练。
- **训练数据:** 训练数据使用了 **EagleChat** 数据集。您可以在 [Hugging Face](https://huggingface.co/datasets/zhaode/EagleChat) 或 [ModelScope](https://modelscope.cn/datasets/zhaode/EagleChat) 上获取该数据集。
- **训练耗时:** 训练使用 4x H20 训练 2 轮，耗时 27 小时，共 108 `H20 卡时`。
</div>