基于重用的作业合并执行优化技术

张进东* **; 谭光明*

文章摘要

张进东* **,谭光明*.基于重用的作业合并执行优化技术[J].高技术通讯(中文),2025,35(10):1037~1050

基于重用的作业合并执行优化技术

Reuse-based job merging execution optimization techniques

DOI：10. 3772 / j. issn. 1002-0470. 2025. 10. 001

中文关键词: 数据重用；计算重用；公共子结构；作业合并；成本模型

英文关键词: data reuse, computation reuse, common substructure, job merging, cost model

基金项目:

作者	单位
张进东* **	(中国科学院计算技术研究所高性能计算研究中心北京 100190）（*中国科学院大学北京 100049）
谭光明*

摘要点击次数: 1106

全文下载次数: 929

中文摘要:

随着大数据分析和云计算的兴起，大规模作业服务在分布式集群中运行时通常具有频繁的作业重复。为了降低海量大规模数据处理应用中数据重用和计算重用导致的作业延迟和内存占用，如何有效搜索和重用不同目标层次的计算重叠是处理重复性作业面临的难题，对此本文提出了一种基于重用的作业合并执行系统MergeLap。MergeLap采用作业结构签名机制和基于成本模型的公共子结构选择策略来快速识别和搜索极大公共子结构(common substracture，CS)；利用子结构缓存的链式缓存结构，可以对中间结果进行压缩缓存，以便快速索引并降低内存消耗。实验结果表明，本文方法能够有效减少批量作业的执行时间，提升内存的使用效率；性能上与原生SparkSQL相比，MergeLap能够对多工作负载中批量作业最高减少46.5%的运行时间和60.7%的缓存占用。

英文摘要:

With the rise of cloud computing and big data analytics, large-scale job services running in distributed clusters often exhibit significant job overlapping. Effectively identifying and reusing computation overlaps is crucial to mitigate job delays and memory overhead caused by data and computation reuse in large-scale data processing applications. To address this challenge, this paper proposes a reuse-based job merge execution system named MergeLap. MergeLap employs a job structure signature mechanism and a cost model-based common substructure selection strategy to efficiently identify and search for extremely maximal common substructures. By utilizing a chain cache structure for substructure caching, intermediate results can be stored for fast indexing while reducing memory consumption. Experimental results demonstrate that the proposed approach effectively reduces job execution time and improves memory usage efficiency. Compared with native SparkSQL, MergeLap reduces the running time of batch jobs across multiple workloads by up to 46.5%, and decreases cache usage by up to 60.7%.

查看全文查看/发表评论下载PDF阅读器

关闭