Diffusers documentation
T-GATE
开始Diffusers
DiffusionPipeline
Inference
Inference optimization
Accelerate inferenceCachingReduce memory usageCompile and offloading quantized models
Community optimizations
Hybrid Inference
Modular Diffusers
Training
Model accelerators and hardware
Specific pipeline examples
Resources
You are viewing v0.37.0 version. A newer version v0.38.0 is available.
T-GATE
T-GATE 通过跳过交叉注意力计算一旦收敛,加速了 Stable Diffusion、PixArt 和 Latency Consistency Model 管道的推理。此方法不需要任何额外训练,可以将推理速度提高 10-50%。T-GATE 还与 DeepCache 等其他优化方法兼容。
开始之前,请确保安装 T-GATE。
pip install tgate pip install -U torch diffusers transformers accelerate DeepCache
要使用 T-GATE 与管道,您需要使用其对应的加载器。
| 管道 | T-GATE 加载器 |
|---|---|
| PixArt | TgatePixArtLoader |
| Stable Diffusion XL | TgateSDXLLoader |
| Stable Diffusion XL + DeepCache | TgateSDXLDeepCacheLoader |
| Stable Diffusion | TgateSDLoader |
| Stable Diffusion + DeepCache | TgateSDDeepCacheLoader |
接下来,创建一个 TgateLoader,包含管道、门限步骤(停止计算交叉注意力的时间步)和推理步骤数。然后在管道上调用 tgate 方法,提供提示、门限步骤和推理步骤数。
让我们看看如何为几个不同的管道启用此功能。
PixArt
Stable Diffusion XL
StableDiffusionXL with DeepCache
Latent Consistency Model
使用 T-GATE 加速 PixArtAlphaPipeline:
import torch
from diffusers import PixArtAlphaPipeline
from tgate import TgatePixArtLoader
pipe = PixArtAlphaPipeline.from_pretrained("PixArt-alpha/PixArt-XL-2-1024-MS", torch_dtype=torch.float16)
gate_step = 8
inference_step = 25
pipe = TgatePixArtLoader(
pipe,
gate_step=gate_step,
num_inference_steps=inference_step,
).to("cuda")
image = pipe.tgate(
"An alpaca made of colorful building blocks, cyberpunk.",
gate_step=gate_step,
num_inference_steps=inference_step,
).images[0]T-GATE 还支持 StableDiffusionPipeline 和 PixArt-alpha/PixArt-LCM-XL-2-1024-MS。
基准测试
| 模型 | MACs | 参数 | 延迟 | 零样本 10K-FID on MS-COCO |
|---|---|---|---|---|
| SD-1.5 | 16.938T | 859.520M | 7.032s | 23.927 |
| SD-1.5 w/ T-GATE | 9.875T | 815.557M | 4.313s | 20.789 |
| SD-2.1 | 38.041T | 865.785M | 16.121s | 22.609 |
| SD-2.1 w/ T-GATE | 22.208T | 815.433 M | 9.878s | 19.940 |
| SD-XL | 149.438T | 2.570B | 53.187s | 24.628 |
| SD-XL w/ T-GATE | 84.438T | 2.024B | 27.932s | 22.738 |
| Pixart-Alpha | 107.031T | 611.350M | 61.502s | 38.669 |
| Pixart-Alpha w/ T-GATE | 65.318T | 462.585M | 37.867s | 35.825 |
| DeepCache (SD-XL) | 57.888T | - | 19.931s | 23.755 |
| DeepCache 配合 T-GATE | 43.868T | - | 14.666秒 | 23.999 |
| LCM (SD-XL) | 11.955T | 2.570B | 3.805秒 | 25.044 |
| LCM 配合 T-GATE | 11.171T | 2.024B | 3.533秒 | 25.028 |
| LCM (Pixart-Alpha) | 8.563T | 611.350M | 4.733秒 | 36.086 |
| LCM 配合 T-GATE | 7.623T | 462.585M | 4.543秒 | 37.048 |
延迟测试基于 NVIDIA 1080TI,MACs 和 Params 使用 calflops 计算,FID 使用 PytorchFID 计算。
Update on GitHub