面向神威·太湖之光的多核组协同的OpenCL编译方法

伍明川* **; 刘颖*; 李立民*; 冯晓兵* **

文章摘要

伍明川* **,刘颖*,李立民*,冯晓兵* **.面向神威·太湖之光的多核组协同的OpenCL编译方法[J].高技术通讯(中文),2022,32(9):927~936

面向神威·太湖之光的多核组协同的OpenCL编译方法

An inter-CG collaborative OpenCL compilation method on the Sunway TaihuLight supercomputer

DOI：10.3772/j.issn.1002-0470.2022.09.006

中文关键词: OpenCL；国产众核处理器；异构；同步；数据依赖分析

英文关键词: OpenCL, homegrown many-core processor, heterogeneous system, synchronization, data dependency analysis

基金项目:

作者	单位
伍明川* **	(中国科学院计算技术研究所计算机体系结构国家重点实验室北京 100190) (*中国科学院大学北京 100049)
刘颖*	(中国科学院计算技术研究所计算机体系结构国家重点实验室北京 100190) (*中国科学院大学北京 100049)
李立民*	(中国科学院计算技术研究所计算机体系结构国家重点实验室北京 100190) (*中国科学院大学北京 100049)
冯晓兵* **	(中国科学院计算技术研究所计算机体系结构国家重点实验室北京 100190) (*中国科学院大学北京 100049)

摘要点击次数: 2755

全文下载次数: 2562

中文摘要:

近年来，科学领域对高性能计算的需求与日俱增，如何有效利用新型超算架构的计算能力成为研究重点。我国自主研制的神威·太湖之光超算平台，采用了国产异构众核处理器SW26010，其包含4个核组，但未提供核组间的同步机制。为了增加其易编程性，本文提出了面向神威·太湖之光的核组间同步方法，并在SWCL OpenCL编译器中实现了该核组间同步方法。该方法利用跨OpenCL主机内核的数据依赖分析来标识必要的同步操作位置，并通过SW26010的交叉段进行低开销的核组间通信，程序员在不使用消息传递接口(MPI)进行显式控制同步的情况下，可以自动地将一个OpenCL Kernel程序部署到多个核组上。使用SPEC ACCEL 1.2中的OpenCL测试用例在神威·太湖之光平台的实验表明，本方法的加速效果明显优于传统的MPI实现版本。

英文摘要:

In recent years,demands for high performance computing has been increased significantly in various scientific domains. How to effectively utilize the computing power of the new supercomputing architecture has become a research focus. The homegrown Sunway TaihuLight supercomputer adopts the homegrown heterogeneous many-core processor SW26010. In order to efficiently use the computing power of the four core groups on the SW26010 and reduce the difficulty of programming, an inter-CG (core group) synchronization generation method on the Sunway TaihuLight is proposed, and the inter-core synchronization generator based on SWCL OpenCL is designed and implemented. This method proposes data dependency analysis across OpenCL host and kernel to identify the necessary synchronization operation, and uses memory intersection of SW26010 to communicate between core groups, which reduces communication overhead and ensures that programmers do not need to use the message passing interface (MPI) for explicit control synchronization. In this case, one OpenCL Kernel program is automatically deployed to multiple core groups. Experiments are carried out using the OpenCL test cases in SPEC ACCEL 1.2, and the results show that the acceleration effect of this method is significantly better than the traditional MPI implementation version.

查看全文查看/发表评论下载PDF阅读器

关闭