Qwen2.5-7B-Instruct_EAGLE3_UltraChat

Introduction

Qwen2.5-7B-Instruct_EAGLE3_UltraChat is trained based on the open-source Qwen2.5-32B-Instruct model using the SpecForge framework, and can be used for the Eagle-3 speculative decoding algorithm to speed up the inference of large language models during the decoding stage.

This model is an artifact for the paper: Fail Fast, Win Big: Rethinking the Drafting Strategy in Speculative Decoding via Diffusion LLMs.

Training Configuration

We adopted the default training hyperparameters in SpecForge and trained EAGLE-3 to match the target model's output until convergence.

This model checkpoint is obtained after five epochs of training ($\sim$260k training steps with bs=4). We find that even though further training improves training-time accuracy, they have a negligible impact on the end-to-end speedup of EAGLE-3.

Dataset: Utilized the UltraChat-200K dataset.
Training environment: The training was conducted on 4 NVIDIA H100 GPUs with 80 GB VRAM each, leveraging the DeepSpeed framework. Each training epoch took approximately 3.5 hours.

Model Inference Launch Command

vLLM v0.13.0, EAGLE-3 (single chain of draft tokens)

vllm serve Qwen/Qwen2.5-7B-Instruct \
  --dtype auto -tp 2 --max_model_len 2048 \
  --gpu-memory-utilization 0.8 --port 30000 \
  --speculative_config '{"model": "ruipeterpan/Qwen2.5-7B-Instruct_EAGLE3_UltraChat", "draft_tensor_parallel_size": 1, "num_speculative_tokens": 5, "method": "eagle3"}'

vLLM v0.13.0, vanilla decoding

vllm serve Qwen/Qwen2.5-7B-Instruct \
  --dtype auto -tp 2 --max_model_len 2048 \
  --gpu-memory-utilization 0.8 --port 30000

SGLang v0.5.6.post2, EAGLE-3 (tree of draft tokens)

python -m sglang.launch_server --model Qwen/Qwen2.5-7B-Instruct \
  --tp 2 --speculative-algorithm EAGLE3 \
  --speculative-draft-model-path ruipeterpan/Qwen2.5-7B-Instruct_EAGLE3_UltraChat \
  --speculative-num-steps 8 \
  --speculative-eagle-topk 10 \
  --speculative-num-draft-tokens 60 \
  --mem-fraction 0.8 \
  --cuda-graph-max-bs 2 --log-level warning --port 30000

SGLang v0.5.6.post2, vanilla decoding

python -m sglang.launch_server --model Qwen/Qwen2.5-7B-Instruct \
  --tp 2 --mem-fraction 0.8 --cuda-graph-max-bs 2 --log-level warning --port 30000

vLLM Performance Evaluation

We run our evaluations on two NVIDIA A6000-48GB GPUs connected via PCIe 4.0 x16. We conducted an extensive hyperparameter search of num_speculative_tokens from 3 to 20. In each entry, we report the best speedup across different speculation lengths. The following table reports the TPT speedup over vanilla decoding.

Target Model	MATH	AIME	GSM8K	GPQA	HumanEval	Average
Qwen2.5-32B-Instruct	2.51x	2.45x	2.27x	2.03x	2.68x	2.39x
Qwen2.5-14B-Instruct	2.33x	2.23x	2.19x	1.98x	2.61x	2.27x
Qwen2.5-7B-Instruct	2.19x	2.05x	2.02x	1.78x	2.25x	2.06x

Relevant Links

Paper: Fail Fast, Win Big: Rethinking the Drafting Strategy in Speculative Decoding via Diffusion LLMs
GitHub Repository: ruipeterpan/failfast
Base Model: Qwen2.5-7B-Instruct

Citation

@article{pan2025failfast,
  title={Fail Fast, Win Big: Rethinking the Drafting Strategy in Speculative Decoding via Diffusion LLMs},
  author={Pan, Rui and Chen, Zhuofu and Liu, Hongyi and Krishnamurthy, Arvind and Netravali, Ravi},
  journal={arXiv preprint arXiv:2512.20573},
  year={2025}
}

Downloads last month: 51

Safetensors

Model size

0.3B params

Tensor type

I64

BF16

BOOL

Model tree for ruipeterpan/Qwen2.5-7B-Instruct_EAGLE3_UltraChat

Base model

Qwen/Qwen2.5-7B

Finetuned

Qwen/Qwen2.5-7B-Instruct

Finetuned

(2381)

this model

Dataset used to train ruipeterpan/Qwen2.5-7B-Instruct_EAGLE3_UltraChat

Paper for ruipeterpan/Qwen2.5-7B-Instruct_EAGLE3_UltraChat

Fail Fast, Win Big: Rethinking the Drafting Strategy in Speculative Decoding via Diffusion LLMs

Paper • 2512.20573 • Published Dec 23, 2025 • 1