Abstract:
With the rapid development of multimodal models and large language models, embodied intelligence driven by multimodal models has demonstrated excellent performance and strong generalization abilities in various tasks, becoming a frontier topic in scientific research.Firstly, introduces the system architecture of embodied intelligence, centered on environmental perception and understanding, as well as task planning and execution.Then, at the environmental perception and understanding level, it summarizes and analyzes visual models, language models, vision-language models, and multimodal large models that drive embodied intelligence.At the task planning and execution level, it reviews and analyzes vision-language-action models and vision-language-navigation models that contribute to embodied intelligence.Finally, the challenges and future directions for the development of multimodal model-driven embodied intelligence are discussed and summarized.