基于FPGA的浮点可分离卷积神经网络加速方法

张志超* ** *** ****; 王剑* ** ***; 章隆兵* ** ***; 肖俊华* ** ***

文章摘要

张志超* ** *** ****,王剑* ** ***,章隆兵* ** ***,肖俊华* ** ***.基于FPGA的浮点可分离卷积神经网络加速方法[J].高技术通讯(中文),2022,32(5):441~453

基于FPGA的浮点可分离卷积神经网络加速方法

FPGA based floating point separable convolutional neural network acceleration method

DOI：10.3772/j.issn.1002-0470.2022.05.001

中文关键词: 深度可分离卷积；现场可编程门阵列(FPGA)；数据流调度；加速；图像分类

英文关键词: depthwise separable convolution, field programmable gate array (FPGA), data stream scheduling, acceleration, image classification

基金项目:

作者	单位
张志超* * ****	（计算机体系结构国家重点实验室（中国科学院计算技术研究所）北京 100190）（中国科学院计算技术研究所北京 100190）（中国科学院大学北京 100049）（**中国电子科技集团公司第十五研究所北京 100083）
王剑* *	（计算机体系结构国家重点实验室（中国科学院计算技术研究所）北京 100190）（中国科学院计算技术研究所北京 100190）（中国科学院大学北京 100049）（**中国电子科技集团公司第十五研究所北京 100083）
章隆兵* *	（计算机体系结构国家重点实验室（中国科学院计算技术研究所）北京 100190）（中国科学院计算技术研究所北京 100190）（中国科学院大学北京 100049）（**中国电子科技集团公司第十五研究所北京 100083）
肖俊华* *	（计算机体系结构国家重点实验室（中国科学院计算技术研究所）北京 100190）（中国科学院计算技术研究所北京 100190）（中国科学院大学北京 100049）（**中国电子科技集团公司第十五研究所北京 100083）

摘要点击次数: 1600

全文下载次数: 1187

中文摘要:

针对可分离卷积神经网络在星载飞机目标型号分类应用中存在的速度瓶颈以及功耗限制等问题，提出了一种基于现场可编程门阵列（FPGA）数据流调度的浮点深度分离卷积神经网络加速方法，对通用MobileNet的图像分类模型进行加速。采用基于乘法矩阵与前向加法树的深度分离卷积计算阵列设计，解决了深度分离卷积浮点加速的线速吞吐瓶颈。实验结果表明，基于FPGA的目标分类速度为633 FPS，功耗为22.226 W，运算性能为236.04 GFLOPS，计算速度达到了Titan Xp GPU的1.10~2.61倍，计算效能是Titan Xp GPU的7.44~18.66倍。在同类基于FPGA的浮点卷积加速方案中，该方法在运算性能及能效比上达到了最优。同时，该方法提供了与原模型一致性的图像分类准确率，解耦合了软硬件协同开发流程，降低了应用开发人员使用FPGA加速计算的门槛。

英文摘要:

In order to solve the problems of speed bottleneck and power limitation in the application of separable convolutional neural network in space-borne aircraft target classification, a floating point depthwise separable convolution neural network acceleration method is proposed based on field programmable gate array (FPGA) data stream scheduling to accelerate the general MobileNet image classification model. The design of depthwise separable convolution computation array based on multiplication matrix and forward addition tree is adopted to solve the bottleneck of line speed throughput in floating point acceleration of depthwise separable convolution. Experimental results show that the target classification based on FPGA has a speed of 633 FPS, a power consumption of 22.226 W, and a computing performance of 236.04 GFLOPS. The computational speed is 1.10-2.61 times higher than that of Titan Xp GPU, and the computational efficiency is 7.44-18.66 times higher than that of Titan Xp GPU. In the same kind of FPGA-based floating-point convolution acceleration scheme, the proposed method achieves the best performance and energy efficiency ratio. At the same time, the proposed method provides image classification accuracy consistent with the original model, decouples the software/hardware collaborative development process, and reduces the threshold for application developers to use FPGA to accelerate calculation.

查看全文查看/发表评论下载PDF阅读器

关闭