---
license: apache-2.0
tags:
- diffusion
- vision-language
- qwen2.5-vl
pipeline_tag: image-text-to-text
library_name: transformers
---
DiffusionVL: Translating Any Autoregressive Models into
Diffusion Vision Language Models
**_SOTA dVLM Performance with <5% Data & 2.0× Inference Speedup!_**
[Lunbin Zeng](https://github.com/xiazhi1)
1,\*, [Jingfeng Yao](https://github.com/JingfengYao)
1,\*, [Bencheng Liao](https://github.com/LegendBC)
1, [Hongyuan Tao](https://github.com/Hongyuan-Tao)
1, [Wenyu Liu](https://scholar.google.com/citations?user=D7jDk7gAAAAJ&hl=en)
1, [Xinggang Wang](https://xwcv.github.io)
1, ✉️
1Huazhong University of Science and Technology
*equal contribution,
✉️corresponding author, xgwang@hust.edu.cn
[](https://arxiv.org/abs/2512.15713) [](https://huggingface.co/papers/2512.15713)
## 📰 News
- **[2025.12.25]** 🎄 We have completed our release plan ahead of schedule. **DiffusionVL is now fully open-sourced.** Merry Christmas to the community!
- **[2025.12.18]** 🎉 Our paper **DiffusionVL** is released on arXiv! We also release the DiffusionVL models translated from Qwen2.5VL on Hugging Face.
## 🚀 Release Plan
- [x] Release paper
- [x] Release DiffusionVL model weights (translated from AR-VLMs)
- [x] Release DiffusionVL model weights (translated from AR-LMs)
- [x] Release evaluation code
- [x] Release training code
## 📄 Introduction
The diffusion paradigm has emerged as a promising alternative to autoregressive (AR) models, offering the potential for efficient parallel decoding. However, existing diffusion vision language models (dVLMs) largely lag behind mainstream autoregressive vision language models in performance, primarily due to the capability limitations of their base diffusion language models.
DiffusionVL bridges this gap by answering a fundamental question: ***Can we directly translate any existing autoregressive models into powerful diffusion vision language models?*** We propose a diffusion finetuning framework that "translates" any pretrained AR model into a diffusion vision language model through a simple paradigm shift and modality shift. Unlike prior dVLMs restricted by fixed generation lengths, DiffusionVL introduces a novel block decoding strategy. This allows for arbitrary-length generation and KV-cache reuse. With this integrated design, despite training with less than 5% of the training data required by previous methods, DiffusionVL translated from AR-VLMs achieves a state-of-the-art performance among exsiting dVLMs and delivers a 2.0× inference speedup.
## ✨ Highlights
- **Universal Translation Framework:** Translate any AR models into dVLMs with a simple yet effective approach.
- **Superior Performance:** Achieve SOTA dVLM performance using <5% training data (738K vs 16.5M samples).
- **2.0× Faster Inference:** Block decoding strategy enables KV-cache reuse and 2.0× speedup over previous dVLMs.