A Motion is Worth a Hybrid Sentence: Taming Language Model for Unified Motion Understanding, Planning, and Generation

Abstract

Existing LLM-based motion models fail to fully leverage large models' reasoning capabilities for motion-related tasks, exhibiting poor generalization, limited text-motion alignment, and an inability to perform multimodal fused condition-driven motion generation. We argue that these issues arise from the modality gap and the highly coupled nature of motion tokens.

To address this, we proposed the concept of hybrid motion sentences, which combine fine-grained motion descriptions with atomic body-part tokens to bridge the modality gap and resolve the alignment issues caused by highly coupled motion tokens in previous methods. To obtain a large corpus of hybrid motion sentences, we introduced a novel motion-to-text generation method that combines motion operators with GPT-4V, resulting in 68.2 million fine-grained textual descriptions across diverse modalities. To reconstruct high-quality motion from hybrid sentences and further promote semantic alignment, we proposed Semantic-Aware Decoupled Motion Tokenization. Furthermore, we propose MotionUTG based on LLaMA, leveraging hybrid motion sentences for both pretraining and instruction tuning.

Our method achieves strong fine-grained text-motion alignment, impressive zero-shot motion generation, and is the first to support multimodal fused condition-driven motion generation tasks.

A Motion is Worth a Hybrid Sentence: Taming Language Model for Unified Motion Generation by Fine-grained Planning

Abstract

Comparison Videos with Other SOTA Methods

1 Comparison for Text-to-Motion

1.1 Fine-grained control capability

1.2 Zero-shot generation capability

2 Comparison for Text&Speech-to-Gesture