深度卷积的软硬件协同优化设计与实现

齐豪*; 刘少礼**; 李威***

文章摘要

齐豪*,刘少礼**,李威***.深度卷积的软硬件协同优化设计与实现[J].高技术通讯(中文),2022,32(7):696~707

深度卷积的软硬件协同优化设计与实现

Software and hardware co-optimization design and implementation of depthwise convolution

DOI：10.3772/j.issn.1002-0470.2022.07.004

中文关键词: 神经网络；深度卷积；加速器；软硬件协同优化；计算效率

英文关键词: neural network, depthwise convolution, accelerator, software and hardware collaborative optimization, computing efficiency

基金项目:

作者	单位
齐豪*	（中国科学技术大学计算机科学与技术学院合肥 230026）（上海寒武纪信息科技有限公司上海 201306）（**中国科学院计算技术研究所处理器芯片国家重点实验室北京 100190）
刘少礼**	（中国科学技术大学计算机科学与技术学院合肥 230026）（上海寒武纪信息科技有限公司上海 201306）（**中国科学院计算技术研究所处理器芯片国家重点实验室北京 100190）
李威***	（中国科学技术大学计算机科学与技术学院合肥 230026）（上海寒武纪信息科技有限公司上海 201306）（**中国科学院计算技术研究所处理器芯片国家重点实验室北京 100190）

摘要点击次数: 3591

全文下载次数: 2535

中文摘要:

近年来,深度学习技术被广泛应用。由于移动设备同时受到算力和功耗的限制,很多轻量级的网络被提出,比如 Xception、MobileNet 系列等。在这些轻量级网络中,深度卷积的层数占网络中所有卷积层数的 31% ~ 50% ,故如何优化深度卷积的运算是一个值得研究的问题。通用中央处理器(CPU)、固定运算器长度的单指令多数据流(SIMD)处理器均无法高效处理神经网络中的各种规模的深度卷积,性能较低。针对这一问题,本文提出了一种软硬件结合的方法优化深度卷积的计算,通过一个多种权值传输模式的硬件架构设计,结合软件模式选择、数据拆分等优化方式,在提高运算效率的同时减少了访存量。实验结果表明,使用该方法实现的深度卷积加速器,相比通用 CPU 最大可达 9. 3 倍的性能加速,相比运算器长度为 64 的单核 SIMD 处理器最大可达 29. 3 倍的性能加速。

英文摘要:

In recent years, deep learning technology has been widely used. As mobile devices are simultaneously limited by computing power and power consumption, many lightweight networks have been proposed, such as Xception, MobileNet series, etc. In these lightweight networks, the number of depthwise convolutional layers accounts for 31%-50% of all convolutional layers in the network, so how to optimize the operation of depthwise convolution is a problem worth studying. General purpose CPUs and single instruction multiple data (SIMD) processors with fixed arithmetic unit length cannot efficiently process various scales of depthwise convolution in neural networks, and their performance is low. In response to this problem, this paper proposes a combination of software and hardware to optimize the calculation of depthwise convolution, through a hardware architecture design with multiple weight transmission modes combined with software mode selection and data splitting. This optimization method reduces the amount of memory access while improving computing efficiency. The experimental results show that the depthwise convolution accelerator implemented by this method can achieve a maximum performance acceleration of 9.3 times compared with a general purpose CPU, and a maximum performance acceleration of 29.3 times compared with a single core SIMD processor with a length of 64.

查看全文查看/发表评论下载PDF阅读器

关闭