M$^3$GPT: An Advanced Multimodal, Multitask Framework for Motion Comprehension and Generation

Kavli Affiliate: Zhuo Li

| First 5 Authors: Mingshuang Luo, Ruibing Hou, Zhuo Li, Hong Chang, Zimo Liu

| Summary:

This paper presents M$^3$GPT, an advanced $textbf{M}$ultimodal,
$textbf{M}$ultitask framework for $textbf{M}$otion comprehension and
generation. M$^3$GPT operates on three fundamental principles. The first
focuses on creating a unified representation space for various motion-relevant
modalities. We employ discrete vector quantization for multimodal control and
generation signals, such as text, music and motion/dance, enabling seamless
integration into a large language model (LLM) with a single vocabulary. The
second involves modeling model generation directly in the raw motion space.
This strategy circumvents the information loss associated with discrete
tokenizer, resulting in more detailed and comprehensive model generation.
Third, M$^3$GPT learns to model the connections and synergies among various
motion-relevant tasks. Text, the most familiar and well-understood modality for
LLMs, is utilized as a bridge to establish connections between different motion
tasks, facilitating mutual reinforcement. To our knowledge, M$^3$GPT is the
first model capable of comprehending and generating motions based on multiple
signals. Extensive experiments highlight M$^3$GPT’s superior performance across
various motion-relevant tasks and its powerful zero-shot generalization
capabilities for extremely challenging tasks.

| Search Query: ArXiv Query: search_query=au:”Zhuo Li”&id_list=&start=0&max_results=3