Character Mixing for Video Generation

Kavli Affiliate: Yi Zhou | First 5 Authors: Tingting Liao, Tingting Liao, , , | Summary: Imagine Mr. Bean stepping into Tom and Jerry–can we generate videos where characters interact naturally across different worlds? We study inter-character interaction in text-to-video generation, where the key challenge is to preserve each character’s identity and behaviors while enabling […]


Continue.. Character Mixing for Video Generation

Multi-Modal Multi-Task Semantic Communication: A Distributed Information Bottleneck Perspective

Kavli Affiliate: Cheng Peng | First 5 Authors: Yujie Zhou, Yujie Zhou, , , | Summary: Semantic communication (SemCom) shifts the focus from data transmission to meaning delivery, enabling efficient and intelligent communication. Existing AI-based coding schemes for multi-modal multi-task SemCom often require transmitters with full-modal data to participate in all receivers’ tasks, which leads […]


Continue.. Multi-Modal Multi-Task Semantic Communication: A Distributed Information Bottleneck Perspective

Taming Text-to-Sounding Video Generation via Advanced Modality Condition and Interaction

Kavli Affiliate: Cheng Peng | First 5 Authors: Kaisi Guan, Kaisi Guan, , , | Summary: This study focuses on a challenging yet promising task, Text-to-Sounding-Video (T2SV) generation, which aims to generate a video with synchronized audio from text conditions, meanwhile ensuring both modalities are aligned with text. Despite progress in joint audio-video training, two […]


Continue.. Taming Text-to-Sounding Video Generation via Advanced Modality Condition and Interaction

VLA-R1: Enhancing Reasoning in Vision-Language-Action Models

Kavli Affiliate: Zheng Zhu | First 5 Authors: Angen Ye, Angen Ye, , , | Summary: Vision-Language-Action (VLA) models aim to unify perception, language understanding, and action generation, offering strong cross-task and cross-scene generalization with broad impact on embodied AI. However, current VLA models often lack explicit step-by-step reasoning, instead emitting final actions without considering […]


Continue.. VLA-R1: Enhancing Reasoning in Vision-Language-Action Models

EvoWorld: Evolving Panoramic World Generation with Explicit 3D Memory

Kavli Affiliate: Cheng Peng | First 5 Authors: Jiahao Wang, Jiahao Wang, , , | Summary: Humans possess a remarkable ability to mentally explore and replay 3D environments they have previously experienced. Inspired by this mental process, we present EvoWorld: a world model that bridges panoramic video generation with evolving 3D memory to enable spatially […]


Continue.. EvoWorld: Evolving Panoramic World Generation with Explicit 3D Memory

Charge and Valley Hydrodynamics in the Quantum Hall Regime of Gapped Graphene

Kavli Affiliate: Mamoru Matsuo | First 5 Authors: Danyu Shu, Danyu Shu, , , | Summary: We develop a unified viscous hydrodynamics for charge and valley transport in gapped graphene in the quantum Hall regime. We redefine Hall viscosity as a response to static electric-field gradients instead of strain, establishing a derivative hierarchy that fundamentally […]


Continue.. Charge and Valley Hydrodynamics in the Quantum Hall Regime of Gapped Graphene

Erased, But Not Forgotten: Erased Rectified Flow Transformers Still Remain Unsafe Under Concept Attack

Kavli Affiliate: Zheng Zhu | First 5 Authors: Nanxiang Jiang, Nanxiang Jiang, , , | Summary: Recent advances in text-to-image (T2I) diffusion models have enabled impressive generative capabilities, but they also raise significant safety concerns due to the potential to produce harmful or undesirable content. While concept erasure has been explored as a mitigation strategy, […]


Continue.. Erased, But Not Forgotten: Erased Rectified Flow Transformers Still Remain Unsafe Under Concept Attack

Triangle Splatting+: Differentiable Rendering with Opaque Triangles

Kavli Affiliate: Yi Zhou | First 5 Authors: Jan Held, Jan Held, , , | Summary: Reconstructing 3D scenes and synthesizing novel views has seen rapid progress in recent years. Neural Radiance Fields demonstrated that continuous volumetric radiance fields can achieve high-quality image synthesis, but their long training and rendering times limit practicality. 3D Gaussian […]


Continue.. Triangle Splatting+: Differentiable Rendering with Opaque Triangles

EgoDemoGen: Novel Egocentric Demonstration Generation Enables Viewpoint-Robust Manipulation

Kavli Affiliate: Zheng Zhu | First 5 Authors: Yuan Xu, Yuan Xu, , , | Summary: Imitation learning based policies perform well in robotic manipulation, but they often degrade under *egocentric viewpoint shifts* when trained from a single egocentric viewpoint. To address this issue, we present **EgoDemoGen**, a framework that generates *paired* novel egocentric demonstrations […]


Continue.. EgoDemoGen: Novel Egocentric Demonstration Generation Enables Viewpoint-Robust Manipulation

EMMA: Generalizing Real-World Robot Manipulation via Generative Visual Transfer

Kavli Affiliate: Zheng Zhu | First 5 Authors: Zhehao Dong, Zhehao Dong, , , | Summary: Vision-language-action (VLA) models increasingly rely on diverse training data to achieve robust generalization. However, collecting large-scale real-world robot manipulation data across varied object appearances and environmental conditions remains prohibitively time-consuming and expensive. To overcome this bottleneck, we propose Embodied […]


Continue.. EMMA: Generalizing Real-World Robot Manipulation via Generative Visual Transfer