TY - GEN
T1 - Log-based Abnormal Task Detection and Root Cause Analysis for Spark
AU - Lu, Siyang
AU - Rao, Bing Bing
AU - Wei, Xiang
AU - Tak, Byungchul
AU - Wang, Long
AU - Wang, Liqiang
N1 - Publisher Copyright:
© 2017 IEEE.
PY - 2017/9/7
Y1 - 2017/9/7
N2 - Application delays caused by abnormal tasks arecommon problems in big data computing frameworks. Anabnormal task in Spark, which may run slowly withouterror or warning logs, not only reduces its resident node'sperformance, but also affects other nodes' efficiency.Spark log files report neither root causes of abnormal tasks,nor where and when abnormal scenarios happen. AlthoughSpark provides a 'speculation' mechanism to detect stragglertasks, it can only detect tailed stragglers in each stage. Sincethe root causes of abnormal happening are complicated, thereare no effective ways to detect root causes.This paper proposes an approach to detect abnormality andanalyzes root causes using Spark log files. Unlike commononline monitoring or analysis tools, our approach is a pureoff-line method that can analyze abnormality accurately. Ourapproach consists of four steps. First, a parser preprocessesraw log files to generate structured log data. Second, ineach stage of Spark application, we choose features relatedto execution time and data locality of each task, as well asmemory usage and garbage collection of each node. Third,based on the selected features, we detect where and whenabnormalities happen. Finally, we analyze the problems usingweighted factors to decide the probability of root causes. In thispaper, we consider four potential root causes of abnormalities,which include CPU, memory, network, and disk. The proposedmethod has been tested on real-world Spark benchmarks.To simulate various scenario of root causes, we conductedinterference injections related to CPU, memory, network,and Disk. Our experimental results show that the proposedapproach is accurate on detecting abnormal tasks as well asfinding the root causes.
AB - Application delays caused by abnormal tasks arecommon problems in big data computing frameworks. Anabnormal task in Spark, which may run slowly withouterror or warning logs, not only reduces its resident node'sperformance, but also affects other nodes' efficiency.Spark log files report neither root causes of abnormal tasks,nor where and when abnormal scenarios happen. AlthoughSpark provides a 'speculation' mechanism to detect stragglertasks, it can only detect tailed stragglers in each stage. Sincethe root causes of abnormal happening are complicated, thereare no effective ways to detect root causes.This paper proposes an approach to detect abnormality andanalyzes root causes using Spark log files. Unlike commononline monitoring or analysis tools, our approach is a pureoff-line method that can analyze abnormality accurately. Ourapproach consists of four steps. First, a parser preprocessesraw log files to generate structured log data. Second, ineach stage of Spark application, we choose features relatedto execution time and data locality of each task, as well asmemory usage and garbage collection of each node. Third,based on the selected features, we detect where and whenabnormalities happen. Finally, we analyze the problems usingweighted factors to decide the probability of root causes. In thispaper, we consider four potential root causes of abnormalities,which include CPU, memory, network, and disk. The proposedmethod has been tested on real-world Spark benchmarks.To simulate various scenario of root causes, we conductedinterference injections related to CPU, memory, network,and Disk. Our experimental results show that the proposedapproach is accurate on detecting abnormal tasks as well asfinding the root causes.
KW - Log Analysis; Abnormal Task; Root
UR - http://www.scopus.com/inward/record.url?scp=85032347370&partnerID=8YFLogxK
U2 - 10.1109/ICWS.2017.135
DO - 10.1109/ICWS.2017.135
M3 - Conference contribution
AN - SCOPUS:85032347370
T3 - Proceedings - 2017 IEEE 24th International Conference on Web Services, ICWS 2017
SP - 389
EP - 396
BT - Proceedings - 2017 IEEE 24th International Conference on Web Services, ICWS 2017
A2 - Chen, Shiping
A2 - Altintas, Ilkay
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 24th IEEE International Conference on Web Services, ICWS 2017
Y2 - 25 June 2017 through 30 June 2017
ER -