王重熙,章隆兵.基于通用图形处理器的神经网络并行推理加速[J].高技术通讯(中文),2025,35(3):250~261 |
基于通用图形处理器的神经网络并行推理加速 |
Neural network parallel inference acceleration based on general-purpose graphics processing unit |
|
DOI:10. 3772 / j. issn. 1002-0470. 2025. 03. 003 |
中文关键词: 多负载并行加速; 神经网络推理; 通用图形处理器 |
英文关键词: multi-workload parallel acceleration, neural network inference, general-purpose graphics processing unit (GPGPU) |
基金项目: |
作者 | 单位 | 王重熙 | (处理器芯片全国重点实验室(中国科学院计算技术研究所)北京 100190)
(中国科学院大学北京 100049) | 章隆兵 | |
|
摘要点击次数: 131 |
全文下载次数: 99 |
中文摘要: |
通用图形处理器(general purpose graphics processing unit,GPGPU)是目前加速人工智能(artificial intelligence,AI)负载最主要的算力来源,其内存带宽和峰值算力随着AI模型的发展而迅速提高。然而,在神经网络的推理过程中,单样本或小批量的推理难以同时充分利用通用图形处理器中不同的计算、存储和访存资源,造成部分资源闲置。对此,本文提出了基于通用图形处理器的神经网络并行推理加速方法,在通用图形处理器上同时推理多个神经网络,通过同时执行互补的神经网络层充分利用通用图形处理器中的各类资源。首先,使用PyTorch中的统一计算设备架构(compute unified device architecture,CUDA)流以及直接在CUDA流中调用CUDA基础线性代数子程序库(CUDA basic linear algebra subprograms,cuBLAS)和CUDA深度神经网络库(CUDA deep neural network library,cuDNN)2种方式,在它们并行加速效果不及预期的情况下,根据性能分析结果确定了NVIDIA通用图形处理器负载调度机制中对多负载并行的限制因素。随后,基于特定的调度机制,提出了更适合多负载并行核函数的设计方法,并实现了主要的神经网络算子,基于此方法在真实的通用图形处理器平台上实现了神经网络并行推理加速。在RTX3080通用图形处理器上的测试结果表明,该神经网络并行推理加速方法对主流神经网络的并行推理达到了平均1.94倍的加速效果,相较于直接调用cuBLAS和cuDNN库平均1.34倍的加速效果提高了45%,不仅验证了在通用图形处理器上实现神经网络并行推理加速的可行性,同时也为其他各类负载在通用图形处理器上的多负载并行加速提供了道路。 |
英文摘要: |
General-purpose graphics processing units (GPGPUs) are the main source of computing power for accelerating artificial intelligence (AI) workloads. Their memory bandwidth and peak performance have been rapidly increasing along with the rapid development of AI models. However, during the inference of neural networks, unbatched or small-batch inference cannot fully utilize the different types of computing, storage, and transmission resources in GPGPUs, resulting in idle resources. To address this issue, this paper proposes a parallel inference acceleration method for neural networks based on GPGPUs. We attempt to concurrently execute multiple neural networks on a GPGPU and fully utilize the various resources in GPGPUs by executing complementary neural network layers simultaneously. When the parallel acceleration improvements of using compute unified device architecture (CUDA) streams in PyTorch and directly calling the CUDA basic linear algebra subprograms (cuBLAS) and CUDA deep neural network library (cuDNN) fail to meet expectations, we then examine the limitations in the scheduling mechanisms of NVIDIA GPGPUs.Based on the dissected scheduling mechanisms, a design method for concurrent kernels to bypass the scheduling limitations is proposed, which is then utilized to implement the main neural network operators and achieve parallel inference acceleration on a real GPGPU platform.The experimental results on an RTX3080 GPGPU demonstrate that the proposed method achieves an average acceleration of 1.94 times for neural network parallel inference, which is 45% higher than the average acceleration of 1.34 times when directly calling the cuBLAS and cuDNN libraries in CUDA streams. This not only verifies the feasibility of neural network parallel inference acceleration but also provides a way for deeper digging into more multi-workload parallel acceleration on GPGPUs. |
查看全文
查看/发表评论 下载PDF阅读器 |
关闭 |