Kavli Affiliate: Jing Wang
| First 5 Authors: Yucheng Xie, Yucheng Xie, , ,
| Summary:
Diffusion models have advanced from text-to-image (T2I) to image-to-image
(I2I) generation by incorporating structured inputs such as depth maps,
enabling fine-grained spatial control. However, existing methods either train
separate models for each condition or rely on unified architectures with
entangled representations, resulting in poor generalization and high adaptation
costs for novel conditions. To this end, we propose DivControl, a decomposable
pretraining framework for unified controllable generation and efficient
adaptation. DivControl factorizes ControlNet via SVD into basic
components-pairs of singular vectors-which are disentangled into
condition-agnostic learngenes and condition-specific tailors through knowledge
diversion during multi-condition training. Knowledge diversion is implemented
via a dynamic gate that performs soft routing over tailors based on the
semantics of condition instructions, enabling zero-shot generalization and
parameter-efficient adaptation to novel conditions. To further improve
condition fidelity and training efficiency, we introduce a representation
alignment loss that aligns condition embeddings with early diffusion features.
Extensive experiments demonstrate that DivControl achieves state-of-the-art
controllability with 36.4$times$ less training cost, while simultaneously
improving average performance on basic conditions. It also delivers strong
zero-shot and few-shot performance on unseen conditions, demonstrating superior
scalability, modularity, and transferability.
| Search Query: ArXiv Query: search_query=au:”Jing Wang”&id_list=&start=0&max_results=3