多模态模型驱动的具身智能研究综述

Survey on Multimodal Model-driven Embodied Intelligence Research

  • 摘要: 随着多模态模型与大语言模型的迅速发展,多模态模型驱动的具身智能在各类任务中展现出良好的应用效果和强大的泛化能力,已成为科学研究的前沿热点。首先,阐述以环境感知与理解、任务规划与执行为核心的具身智能系统架构;其次,在环境感知与理解层面对驱动具身智能的视觉模型、语言模型、视觉-语言模型和多模态大模型进行总结分析,在任务规划与执行层面对驱动具身智能的视觉-语言-动作模型和视觉-语言-导航模型进行归纳分析;最后,对多模态模型驱动的具身智能面临的挑战与发展方向进行总结与展望。

     

    Abstract: With the rapid development of multimodal models and large language models, embodied intelligence driven by multimodal models has demonstrated excellent performance and strong generalization abilities in various tasks, becoming a frontier topic in scientific research.Firstly, introduces the system architecture of embodied intelligence, centered on environmental perception and understanding, as well as task planning and execution.Then, at the environmental perception and understanding level, it summarizes and analyzes visual models, language models, vision-language models, and multimodal large models that drive embodied intelligence.At the task planning and execution level, it reviews and analyzes vision-language-action models and vision-language-navigation models that contribute to embodied intelligence.Finally, the challenges and future directions for the development of multimodal model-driven embodied intelligence are discussed and summarized.

     

/

返回文章
返回