TY - GEN
T1 - MapReduce scheduler to minimize the size of intermediate data in shuffle phase
AU - Jeyaraj, Rathinaraja
AU - Ananthanarayana, V. S.
AU - Paul, Anand
N1 - Publisher Copyright:
© 2019 IEEE.
PY - 2019/6
Y1 - 2019/6
N2 - Hadoop MapReduce is one of the cost-effective ways for processing huge data in this decade. Despite it is opensource, setting up Hadoop on-premise is not affordable for small-scale businesses and research entities. Therefore, consuming Hadoop MapReduce as a service from cloud is on increasing pace as it is scalable on-demand and based on pay-per-use model. In such multi-tenant environment, virtual bandwidth is an expensive commodity and co-located virtual machines race each other to make use of the bandwidth. A study shows that 26%-70% of MapReduce job latency is due to shuffle phase in MapReduce execution sequence. Primary expectation of a typical cloud user is to minimize the service usage cost. Allocating less bandwidth to the service costs less but increases job latency, consequently increases makespan. This trade-off is compromised by minimizing the amount of intermediate data generated in shuffle phase at application level. To achieve this, we proposed Time Sharing MapReduce Job Scheduler to minimize the amount of intermediate data; thus, service cost is cut down. As a by-product, MapReduce job latency and makespan also are improved. Result shows that our proposed model minimized the size of intermediate data upto 62.1%, when compared to the classical schedulers with combiners.
AB - Hadoop MapReduce is one of the cost-effective ways for processing huge data in this decade. Despite it is opensource, setting up Hadoop on-premise is not affordable for small-scale businesses and research entities. Therefore, consuming Hadoop MapReduce as a service from cloud is on increasing pace as it is scalable on-demand and based on pay-per-use model. In such multi-tenant environment, virtual bandwidth is an expensive commodity and co-located virtual machines race each other to make use of the bandwidth. A study shows that 26%-70% of MapReduce job latency is due to shuffle phase in MapReduce execution sequence. Primary expectation of a typical cloud user is to minimize the service usage cost. Allocating less bandwidth to the service costs less but increases job latency, consequently increases makespan. This trade-off is compromised by minimizing the amount of intermediate data generated in shuffle phase at application level. To achieve this, we proposed Time Sharing MapReduce Job Scheduler to minimize the amount of intermediate data; thus, service cost is cut down. As a by-product, MapReduce job latency and makespan also are improved. Result shows that our proposed model minimized the size of intermediate data upto 62.1%, when compared to the classical schedulers with combiners.
KW - MapReduce scheduler
KW - Shuffle phase
UR - https://www.scopus.com/pages/publications/85078032803
U2 - 10.1109/ICIS46139.2019.8940354
DO - 10.1109/ICIS46139.2019.8940354
M3 - Conference contribution
AN - SCOPUS:85078032803
T3 - Proceedings - 18th IEEE/ACIS International Conference on Computer and Information Science, ICIS 2019
SP - 30
EP - 34
BT - Proceedings - 18th IEEE/ACIS International Conference on Computer and Information Science, ICIS 2019
A2 - Xu, Simon
A2 - Wang, Yongbin
A2 - Shi, Mingyong
A2 - Shang, Wenqian
A2 - Liu, Jiefeng
A2 - Zhang, Kailong
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 18th IEEE/ACIS International Conference on Computer and Information Science, ICIS 2019
Y2 - 17 June 2019 through 19 June 2019
ER -