Kavli Affiliate: Jing Wang
| First 5 Authors: Jing Wang, Ao Ma, Jiasong Feng, Dawei Leng, Yuhui Yin
| Summary:
The global self-attention mechanism in diffusion transformers involves
redundant computation due to the sparse and redundant nature of visual
information, and the attention map of tokens within a spatial window shows
significant similarity. To address this redundancy, we propose the
Proxy-Tokenized Diffusion Transformer (PT-DiT), which employs sparse
representative token attention (where the number of representative tokens is
much smaller than the total number of tokens) to model global visual
information efficiently. Specifically, within each transformer block, we
compute an averaging token from each spatial-temporal window to serve as a
proxy token for that region. The global semantics are captured through the
self-attention of these proxy tokens and then injected into all latent tokens
via cross-attention. Simultaneously, we introduce window and shift window
attention to address the limitations in detail modeling caused by the sparse
attention mechanism. Building on the well-designed PT-DiT, we further develop
the Qihoo-T2X family, which includes a variety of models for T2I, T2V, and T2MV
tasks. Experimental results show that PT-DiT achieves competitive performance
while reducing the computational complexity in both image and video generation
tasks (e.g., a 49% reduction compared to DiT and a 34% reduction compared to
PixArt-$alpha$). The visual exhibition and source code of Qihoo-T2X is
available at https://360cvgroup.github.io/Qihoo-T2X/.
| Search Query: ArXiv Query: search_query=au:”Jing Wang”&id_list=&start=0&max_results=3