Cloudless-Training:基于serverless的高效跨地域分布式ML训练框架

谭文婷* **; 吕存驰* **; 史骁* ***; 赵晓芳* ****

文章摘要

谭文婷* **,吕存驰* **,史骁* ***,赵晓芳* ****.Cloudless-Training:基于serverless的高效跨地域分布式ML训练框架[J].高技术通讯(中文),2024,34(3):219~232

Cloudless-Training:基于serverless的高效跨地域分布式ML训练框架

Cloudless-Training: a framework to improve efficiency of geo-distributed ML training based on serverless

DOI：10. 3772 / j. issn. 1002-0470. 2024. 03. 001

中文关键词: 跨地域分布式机器学习（ML）训练；跨云ML训练；分布式训练框架；serverless；跨云模型同步

英文关键词: geo-distributed machine learning(ML) training, cross cloud ML training, distributed training framework, serverless, cross cloud model synchronization

基金项目:

作者	单位
谭文婷* **	（中国科学院计算技术研究所北京 100190）（中国科学院大学北京 100049）（中科南京信息高铁研究院南京 211135）（**中科苏州智能计算技术研究院苏州 215028）
吕存驰* **
史骁* ***
赵晓芳* ****

摘要点击次数: 3708

全文下载次数: 2735

中文摘要:

跨地域分布式机器学习（ML）训练能够联合多区域的云资源协作训练，可满足许多新兴ML场景（比如大型模型训练、联邦学习）的训练需求。但其训练效率仍受2方面挑战的制约。首先，多区域云资源缺乏有效的弹性调度，这会影响训练的资源利用率和性能；其次，模型跨地域同步需要在广域网(WAN)上高频通信，受WAN的低带宽和高波动的影响，会产生巨大通信开销。本文提出Cloudless-Training，从3个方面实现高效的跨地域分布式ML训练。首先，它基于serverless计算模式实现，使用控制层和训练执行层的2层架构，支持多云区域的弹性调度和通信。其次，它提供一种弹性调度策略，根据可用云资源的异构性和训练数据集的分布自适应地部署训练工作流。最后，它提供了2种高效的跨云同步策略，包括基于梯度累积的异步随机梯度下降(ASGD GA)和跨云参数服务器(PS)间的模型平均（MA）。Cloudless-Training是基于OpenFaaS实现的，并被部署在腾讯云上评估，实验结果表明Cloudless-Training可显著地提高跨地域分布式ML训练的资源利用率（训练成本降低了9.2%~24.0%）和同步效率（训练速度最多比基线快1.7倍），并能保证模型的收敛精度。

英文摘要:

Geo-distributed machine learning (ML) training can benefit many emerging ML scenarios (e.g., large model training, federated learning) with multi-regional cloud resources and wide area network. However, its efficiency is limited due to two challenges. First, efficient elastic scheduling of multi-regional cloud resources is usually missing, affecting resource utilization and performance of training. Second, training communication on wide area network (WAN) is still the main overhead, easily subjected to low bandwidth and high fluctuations of WAN. In this paper, a framework Cloudless-Training is proposed to realize efficient geo-distributed ML training in 3 aspects. First, it uses a two-layer architecture with control and physical training planes to support elastic scheduling and communication for multi-regional clouds in a serverless manner. Second, it provides an elastic scheduling strategy that can deploy training workflows adaptively according to the heterogeneity of available cloud resources and distribution of pre-existing training datasets. Third, it provides two new synchronization strategies for training partitions among clouds, including asynchronous stochastic gradient descent with gradient accumulation (ASGD-GA) and inter-parameter server (PS) model averaging (MA). It is implemented with OpenFaaS and evaluated on Tencent Cloud. Experimental results show that Cloudless-Training can support general ML training in a geo-distributed way, and greatly improve resource utilization (e.g., 9.2%-24.0% training cost reduction) and synchronization efficiency (e.g., 1.7 times speedup of training over baseline at most) with model correctness guarantees.

查看全文查看/发表评论下载PDF阅读器

关闭