| | --- |
| | frameworks: |
| | - Pytorch |
| | tasks: |
| | - text-to-image-synthesis |
| |
|
| | |
| | |
| | |
| |
|
| | |
| | |
| | |
| |
|
| | |
| | |
| | |
| |
|
| | |
| | |
| | |
| |
|
| | |
| | |
| | |
| |
|
| | |
| | |
| | |
| | base_model_relation: finetune |
| | base_model: |
| | - Qwen/Qwen-Image |
| | --- |
| | # Qwen-Image Full Distillation Accelerated Model |
| |
|
| |  |
| |
|
| | ## Model Introduction |
| |
|
| | This model is a distilled and accelerated version of [Qwen-Image](https://www.modelscope.cn/models/Qwen/Qwen-Image). |
| | The original model requires 40 inference steps and uses classifier-free guidance (CFG), resulting in a total of 80 forward passes. |
| | The distilled accelerated model only requires 15 inference steps and does not need CFG, resulting in only 15 forward passes — **achieving about 5× speed-up**. |
| | Of course, the number of inference steps can be further reduced if needed, but generation quality may decrease. |
| |
|
| | The training framework is built using [DiffSynth-Studio](https://github.com/modelscope/DiffSynth-Studio). |
| | The training dataset consists of 16,000 images generated by the original model using randomly sampled prompts from [DiffusionDB](https://www.modelscope.cn/datasets/AI-ModelScope/diffusiondb). |
| | Training was conducted for about 1 day on 8 × MI308X GPUs. |
| |
|
| | ## Performance Comparison |
| |
|
| | | | Original Model | Original Model | Accelerated Model | |
| | |-|-|-|-| |
| | | Inference Steps | 40 | 15 | 15 | |
| | | CFG Scale | 4 | 1 | 1 | |
| | | Forward Passes | 80 | 15 | 15 | |
| | | Example 1 |  |  |  | |
| | | Example 2 |  |  |  | |
| | | Example 3 |  |  |  | |
| |
|
| | ## Inference Code |
| |
|
| | ```shell |
| | git clone https://github.com/modelscope/DiffSynth-Studio.git |
| | cd DiffSynth-Studio |
| | pip install -e . |
| | ``` |
| |
|
| | ```python |
| | from diffsynth.pipelines.qwen_image import QwenImagePipeline, ModelConfig |
| | import torch |
| | |
| | |
| | pipe = QwenImagePipeline.from_pretrained( |
| | torch_dtype=torch.bfloat16, |
| | device="cuda", |
| | model_configs=[ |
| | ModelConfig(model_id="DiffSynth-Studio/Qwen-Image-Distill-Full", origin_file_pattern="diffusion_pytorch_model*.safetensors"), |
| | ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="text_encoder/model*.safetensors"), |
| | ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="vae/diffusion_pytorch_model.safetensors"), |
| | ], |
| | tokenizer_config=ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="tokenizer/"), |
| | ) |
| | prompt = "Delicate portrait, underwater girl, flowing blue dress, hair floating, clear light and shadows, bubbles surrounding, serene face, exquisite details, dreamy and beautiful." |
| | image = pipe(prompt, seed=0, num_inference_steps=15, cfg_scale=1) |
| | image.save("image.jpg") |
| | ``` |
| |
|
| | --- |
| | license: apache-2.0 |
| | --- |
| |
|