行业高质量数据集生命周期模型构建与应用研究

Research on the Construction and Application of a High-quality Industry Datasets Life Cycle Model

  • 摘要: 在人工智能与大模型技术深度赋能的背景下,数据驱动已成为行业决策的核心模式,而高质量数据集作为关键生产要素,对于管理的系统性至关重要。针对当前行业数据集建设缺乏全生命周期统筹规划,导致数据集质量难以保障的问题,借鉴成熟软件生命周期管理方法论,将行业数据集视为特殊产品进行全流程管控,包括规划、开发和维护3个阶段,涵盖可行性分析、需求分析、架构设计、数据集开发、质量评测、模型验证、数据运维7个环节,界定各阶段目标、核心任务及标准化成果。同时,结合数据集动态演化特性与行业需求多变的实际情况,建立敏捷式数据集开发生命周期(Dataset Development Life Cycle,DDLC)模型。通过短周期迭代、跨角色协作与闭环优化,实现数据集的持续交付与动态完善。以某轨道交通装备制造企业工艺高质量数据集建设项目为例进行验证,结果表明,该模型可有效保障数据集建设全流程标准化,显著提升工艺大模型在工艺指导书生成、合规性校验等任务中的性能,为行业高质量数据集的构建与管理提供科学、可行的理论与方法。

     

    Abstract: Under the deep empowerment of artificial intelligence and large model technologies, data-driven approaches have become the core model for industry decision-making. High-quality datasets, as critical production factors, are essential for systematic management. Addressing the current lack of comprehensive lifecycle planning in industry dataset construction—which leads to challenges in ensuring dataset quality—a methodology inspired by mature software lifecycle management is applied. Industry datasets are treated as specialized products, subject to full-process control across three stages: planning, development, and maintenance. This encompasses seven phases including feasibility analysis, requirements analysis, architecture design, dataset development, quality evaluation, model validation, and operation and maintenance, with defined objectives, core tasks, and standardized deliverables for each stage. Additionally, considering the dynamic evolution characteristics of datasets and the ever-changing industry demands, an agile dataset development life cycle (DDLC) model is established. Through short-cycle iterations, cross-role collaboration, and closed-loop optimization, this model enables continuous delivery and dynamic refinement of datasets. Validation is conducted using a case study on the construction of a high-quality process dataset for a rail transit equipment manufacturing enterprise. The results demonstrate that the model effectively standardizes the entire dataset construction process, significantly enhances the performance of process-oriented large models in tasks such as generating process instructions and compliance verification, and provides a scientific and practical framework for the construction and management of high-quality industry datasets.

     

/

返回文章
返回