xiazhi commited on
Commit
4879ca9
Β·
verified Β·
1 Parent(s): 3dc8875

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +75 -59
README.md CHANGED
@@ -3,66 +3,82 @@ license: apache-2.0
3
  tags:
4
  - diffusion
5
  - vision-language
6
- - qwen2.5
7
- - siglip
 
8
  ---
9
 
10
- # DiffusionVL-Qwen2.5
11
-
12
- DiffusionVL model with SigLIP vision encoder, PoolerProjector, and Qwen2.5 LLM with BD3LM diffusion-based generation.
13
-
14
- ## Usage
15
-
16
- ```python
17
- from transformers import AutoModelForCausalLM, AutoProcessor
18
- import torch
19
- from PIL import Image
20
-
21
- # Load model
22
- model = AutoModelForCausalLM.from_pretrained(
23
- "path/to/model",
24
- torch_dtype=torch.bfloat16,
25
- device_map="auto",
26
- trust_remote_code=True
27
- )
28
-
29
- # Load processor
30
- processor = AutoProcessor.from_pretrained("path/to/model", trust_remote_code=True)
31
-
32
- # Prepare inputs
33
- image = Image.open("image.jpg").convert("RGB")
34
- messages = [
35
- {"role": "user", "content": [
36
- {"type": "image"},
37
- {"type": "text", "text": "Describe this image."}
38
- ]}
39
- ]
40
- text = processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
41
- inputs = processor(text=[text], images=[image], return_tensors="pt", padding=True)
42
- inputs = {k: v.to(model.device) if hasattr(v, 'to') else v for k, v in inputs.items()}
43
-
44
- # Generate
45
- output_ids = model.generate(
46
- inputs=inputs["input_ids"],
47
- images=inputs.get("pixel_values"),
48
- gen_length=256,
49
- steps=8,
50
- temperature=0.0,
51
- remasking_strategy="low_confidence_static",
52
- )
53
-
54
- # Decode
55
- output_text = processor.decode(output_ids[0], skip_special_tokens=True)
56
- print(output_text)
57
- ```
58
 
59
- ## Model Configuration
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
60
 
61
- - **Architecture**: DiffusionVL_Qwen2_5_ForConditionalGeneration
62
- - **Vision Encoder**: SigLIP (384x384, patch_size=14)
63
- - **MM Projector**: PoolerProjector (Conv2d + MLP)
64
- - **LLM**: Qwen2.5 (standard RoPE)
65
- - **BD3LM Enabled**: True
66
- - **Block Size**: 8
67
- - **Hidden Size**: 3584
68
- - **Num Layers**: 28
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  tags:
4
  - diffusion
5
  - vision-language
6
+ - qwen2.5-vl
7
+ pipeline_tag: image-text-to-text
8
+ library_name: transformers
9
  ---
10
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
 
12
+ <div align="center">
13
+
14
+ <h1>DiffusionVL: Translating Any Autoregressive Models into <br> Diffusion Vision Language Models</h1>
15
+
16
+ **_SOTA dVLM Performance with <5% Data & 2.0Γ— Inference Speedup!_**
17
+
18
+ [Lunbin Zeng](https://github.com/xiazhi1)<sup>1,\*</sup>, [Jingfeng Yao](https://github.com/JingfengYao)<sup>1,\*</sup>, [Bencheng Liao](https://github.com/LegendBC)<sup>1</sup>, [Hongyuan Tao](https://github.com/Hongyuan-Tao)<sup>1</sup>, [Wenyu Liu](https://scholar.google.com/citations?user=D7jDk7gAAAAJ&hl=en)<sup>1</sup>, [Xinggang Wang](https://xwcv.github.io)<sup>1, βœ‰οΈ</sup>
19
+
20
+ <sup>1</sup>Huazhong University of Science and Technology
21
+
22
+ <sup>*</sup>equal contribution, <sup>βœ‰οΈ</sup>corresponding author, xgwang@hust.edu.cn
23
+
24
+ [![arXiv](https://img.shields.io/badge/arXiv-DiffusionVL-b31b1b.svg)](https://arxiv.org/abs/2512.15713) [![Hugging Face Paper](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Paper-red)](https://huggingface.co/papers/2512.15713) <a href="https://github.com/hustvl/DiffusionVL"><img src="https://img.shields.io/badge/GitHub-Repository-black?logo=github" alt="GitHub"></a> <a href="https://huggingface.co/collections/hustvl/diffusionvl"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue" alt="Hugging Face"></a>
25
+
26
+ </div>
27
+
28
+ ## πŸ“° News
29
+
30
+ - **[2025.12.25]** πŸŽ„ We have completed our release plan ahead of schedule. **DiffusionVL is now fully open-sourced.** Merry Christmas to the community!
31
+ - **[2025.12.18]** πŸŽ‰ Our paper **DiffusionVL** is released on arXiv! We also release the DiffusionVL models translated from Qwen2.5VL on Hugging Face.
32
+
33
+ ## πŸš€ Release Plan
34
+ - [x] Release paper
35
+ - [x] Release DiffusionVL model weights (translated from AR-VLMs)
36
+ - [x] Release DiffusionVL model weights (translated from AR-LMs)
37
+ - [x] Release evaluation code
38
+ - [x] Release training code
39
+
40
+ ## πŸ“„ Introduction
41
+
42
+ The diffusion paradigm has emerged as a promising alternative to autoregressive (AR) models, offering the potential for efficient parallel decoding. However, existing diffusion vision language models (dVLMs) largely lag behind mainstream autoregressive vision language models in performance, primarily due to the capability limitations of their base diffusion language models.
43
 
44
+ DiffusionVL bridges this gap by answering a fundamental question: ***Can we directly translate any existing autoregressive models into powerful diffusion vision language models?*** We propose a diffusion finetuning framework that "translates" any pretrained AR model into a diffusion vision language model through a simple paradigm shift and modality shift. Unlike prior dVLMs restricted by fixed generation lengths, DiffusionVL introduces a novel block decoding strategy. This allows for arbitrary-length generation and KV-cache reuse. With this integrated design, despite training with less than 5% of the training data required by previous methods, DiffusionVL translated from AR-VLMs achieves a state-of-the-art performance among exsiting dVLMs and delivers a 2.0Γ— inference speedup.
45
+
46
+ ## ✨ Highlights
47
+
48
+ - **Universal Translation Framework:** Translate any AR models into dVLMs with a simple yet effective approach.
49
+
50
+ - **Superior Performance:** Achieve SOTA dVLM performance using <5% training data (738K vs 16.5M samples).
51
+
52
+ - **2.0Γ— Faster Inference:** Block decoding strategy enables KV-cache reuse and 2.0Γ— speedup over previous dVLMs.
53
+
54
+ <div align="center">
55
+ <img src="https://github.com/hustvl/DiffusionVL/raw/main/assets/benchmark.png" alt="Benchmark Image" width="800">
56
+ <img src="https://github.com/hustvl/DiffusionVL/raw/main/assets/framework.png" alt="Framework" width="800">
57
+ </div>
58
+
59
+ ## πŸš€ Get Started
60
+
61
+ | Document | Description |
62
+ | :--- | :--- |
63
+ | [Installation](https://github.com/hustvl/DiffusionVL/raw/main/docs/INSTALLATION.md) | Environment setup, data and model preparation |
64
+ | [Training & Evaluation](https://github.com/hustvl/DiffusionVL/raw/main/docs/TRAINING_EVALUATION.md) | Train and evaluate DiffusionVL models |
65
+ | [Inference](https://github.com/hustvl/DiffusionVL/raw/main/docs/INFERENCE.md) | Quick inference with pre-trained models |
66
+
67
+
68
+ ## ❀️ Acknowledgements
69
+
70
+ This repo is mainly built on [Qwen2.5-VL](https://github.com/QwenLM/Qwen3-VL), [LLaDA-V](https://github.com/ML-GSAI/LLaDA-V), [BD3LMs](https://github.com/kuleshov-group/bd3lms) and [SDAR](https://github.com/JetAstra/SDAR), [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval). We thank the authors for their open-source contributions.
71
+
72
+ ## πŸ“ Citation
73
+ If you find our work useful, please cite our paper:
74
+ ```
75
+ @misc{zeng2025diffusionvltranslatingautoregressivemodels,
76
+ title={DiffusionVL: Translating Any Autoregressive Models into Diffusion Vision Language Models},
77
+ author={Lunbin Zeng and Jingfeng Yao and Bencheng Liao and Hongyuan Tao and Wenyu Liu and Xinggang Wang},
78
+ year={2025},
79
+ eprint={2512.15713},
80
+ archivePrefix={arXiv},
81
+ primaryClass={cs.CV},
82
+ url={https://arxiv.org/abs/2512.15713},
83
+ }
84
+ ```