项目摘要
Scientists conduct analyses that rely on large-scale simulations to achieve breakthroughs in multiple scientific domains, such as climate, energy, quantum physics, and more. As system complexity increases, future large-scale systems and the data generated, processed, stored, and transmitted by them are subject to increasingly higher occurrences of soft errors or silent data corruption. Importantly, this silently compromised data may go undetected because current High-Performance Computing (HPC) software stacks largely lack mechanisms to inform scientists of silent data corruption that could adversely affect the integrity of their scientific interpretation. In order to combat silent data corruption in HPC systems, this project introduces highly efficient and cost-effective mechanisms to monitor and detect soft errors. Through the use of unsupervised error detection, this project increases scientists’ confidence in extreme-scale scientific simulations and data analyses, which advance the data-intensive science discovery needed to solve some of the world’s most complex contemporary problems, such as predicting severe weather conditions, designing new materials, making new energy sources pragmatic, and others. The methodologies of this project are also applicable to general-purpose computing systems, increasing security and reliability on traditional computing and Internet of Things devices.This research applies compressive sensing and machine learning, especially an unsupervised approach, to accurately detect soft and hardware errors in current and future HPC systems. A compact representation that corresponds to the original dataset is efficiently obtained through compressive sensing coupled with a hardware-assisted data collection mechanism that requires no changes to existing infrastructure. This is used with a spatiotemporal anomaly detection model for in situ characterization of soft errors and errors caused by a hardware malfunction, detecting anomalies deviating from acceptable ranges. The approach is built into the scientific workflow and operates seamlessly with the application without requiring application modification or customization. Validation of the mechanism across multiple HPC platforms using scientific workflows allows scientists to analyze and verify their datasets with increased levels of trust.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
科学家进行的分析依靠大规模模拟来实现多个科学领域的突破,例如气候,能量,量子物理等。随着系统复杂性的增加,未来的大规模系统以及它们生成,处理,存储和传输的数据会受到越来越高的软错误或无声数据损坏的情况。重要的是,由于当前的高性能计算(HPC)软件堆叠在很大程度上缺乏向科学家告知无声数据腐败的机制,因此这种静默妥协的数据可能不会被发现,这可能会对其科学解释的完整性产生不利影响。为了打击HPC系统中的无声数据损坏,该项目引入了高效且具有成本效益的机制来监视和检测软错误。通过使用无监督的错误检测,该项目增加了科学家对极端尺度的科学模拟和数据分析的信心,这些模拟和数据分析可以推进解决世界上一些最复杂的当代问题所需的数据密集型科学发现,例如预测恶劣的天气条件,设计新材料,设计新材料,使新能源源源和其他材料。该项目的方法也适用于通用计算系统,提高了传统计算和物联网设备的安全性和可靠性。这项研究应用压缩感官和机器学习,尤其是无监督的方法,以准确地检测当前和未来HPC系统的软件错误。通过压缩灵敏度以及硬件辅助数据收集机制有效地获得了与原始数据集相对应的紧凑表示形式,该机制无需更改现有基础架构。这与时空异常检测模型一起使用,用于原位表征由硬件故障引起的软误差和误差,检测出偏离可接受范围的异常。该方法内置在科学工作流程中,并在应用程序中无缝地运行,而无需进行应用程序修改或自定义。使用科学工作流对多个HPC平台进行机制的验证,使科学家可以通过提高信任水平来分析和验证其数据集。该奖项反映了NSF的法定任务,并通过使用基金会的知识分子优点和更广泛的影响评估标准来评估值得支持。
项目成果
期刊论文数量(1)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
数据更新时间:{{ journalArticles.updateTime }}
数据更新时间:{{ monograph.updateTime }}
数据更新时间:{{ sciAwards.updateTime }}
数据更新时间:{{ conferencePapers.updateTime }}
数据更新时间:{{ patent.updateTime }}