Log-based Abnormal Task Detection and Root Cause Analysis for Spark

Siyang Lu, Bing Bing Rao, Xiang Wei, Byungchul Tak, Long Wang, Liqiang Wang

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

61 Scopus citations

Abstract

Application delays caused by abnormal tasks arecommon problems in big data computing frameworks. Anabnormal task in Spark, which may run slowly withouterror or warning logs, not only reduces its resident node'sperformance, but also affects other nodes' efficiency.Spark log files report neither root causes of abnormal tasks,nor where and when abnormal scenarios happen. AlthoughSpark provides a 'speculation' mechanism to detect stragglertasks, it can only detect tailed stragglers in each stage. Sincethe root causes of abnormal happening are complicated, thereare no effective ways to detect root causes.This paper proposes an approach to detect abnormality andanalyzes root causes using Spark log files. Unlike commononline monitoring or analysis tools, our approach is a pureoff-line method that can analyze abnormality accurately. Ourapproach consists of four steps. First, a parser preprocessesraw log files to generate structured log data. Second, ineach stage of Spark application, we choose features relatedto execution time and data locality of each task, as well asmemory usage and garbage collection of each node. Third,based on the selected features, we detect where and whenabnormalities happen. Finally, we analyze the problems usingweighted factors to decide the probability of root causes. In thispaper, we consider four potential root causes of abnormalities,which include CPU, memory, network, and disk. The proposedmethod has been tested on real-world Spark benchmarks.To simulate various scenario of root causes, we conductedinterference injections related to CPU, memory, network,and Disk. Our experimental results show that the proposedapproach is accurate on detecting abnormal tasks as well asfinding the root causes.

Original languageEnglish
Title of host publicationProceedings - 2017 IEEE 24th International Conference on Web Services, ICWS 2017
EditorsShiping Chen, Ilkay Altintas
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages389-396
Number of pages8
ISBN (Electronic)9781538607527
DOIs
StatePublished - 7 Sep 2017
Event24th IEEE International Conference on Web Services, ICWS 2017 - Honolulu, United States
Duration: 25 Jun 201730 Jun 2017

Publication series

NameProceedings - 2017 IEEE 24th International Conference on Web Services, ICWS 2017

Conference

Conference24th IEEE International Conference on Web Services, ICWS 2017
Country/TerritoryUnited States
CityHonolulu
Period25/06/1730/06/17

Keywords

  • Log Analysis; Abnormal Task; Root

Fingerprint

Dive into the research topics of 'Log-based Abnormal Task Detection and Root Cause Analysis for Spark'. Together they form a unique fingerprint.

Cite this