TY - JOUR
T1 - Failure Diagnosis for Distributed Systems Using Targeted Fault Injection
AU - Pham, Cuong
AU - Wang, Long
AU - Tak, Byung Chul
AU - Baset, Salman
AU - Tang, Chunqiang
AU - Kalbarczyk, Zbigniew
AU - Iyer, Ravishankar K.
N1 - Publisher Copyright:
© 2016 IEEE.
PY - 2017/2/1
Y1 - 2017/2/1
N2 - This paper introduces a novel approach to automating failure diagnostics in distributed systems by combining fault injection and data analytics. We use fault injection to populate the database of failures for a target distributed system. When a failure is reported from production environment, the database is queried to find 'matched' failures generated by fault injections. Relying on the assumption that similar faults generate similar failures, we use information from the matched failures as hints to locate the actual root cause of the reported failures. In order to implement this approach, we introduce techniques for (i) reconstructing end-to-end execution flows of distributed software components, (ii) computing the similarity of the reconstructed flows, and (iii) performing precise fault injection at pre-specified executing points in distributed systems. We have evaluated our approach using an OpenStack cloud platform, a popular cloud infrastructure management system. Our experimental results showed that this approach is effective in determining the root causes, e.g., fault types and affected components, for 71-100 percent of tested failures. Furthermore, it can provide fault locations close to actual ones and can easily be used to find and fix actual root causes. We have also validated this technique by localizing real bugs that occurred in OpenStack.
AB - This paper introduces a novel approach to automating failure diagnostics in distributed systems by combining fault injection and data analytics. We use fault injection to populate the database of failures for a target distributed system. When a failure is reported from production environment, the database is queried to find 'matched' failures generated by fault injections. Relying on the assumption that similar faults generate similar failures, we use information from the matched failures as hints to locate the actual root cause of the reported failures. In order to implement this approach, we introduce techniques for (i) reconstructing end-to-end execution flows of distributed software components, (ii) computing the similarity of the reconstructed flows, and (iii) performing precise fault injection at pre-specified executing points in distributed systems. We have evaluated our approach using an OpenStack cloud platform, a popular cloud infrastructure management system. Our experimental results showed that this approach is effective in determining the root causes, e.g., fault types and affected components, for 71-100 percent of tested failures. Furthermore, it can provide fault locations close to actual ones and can easily be used to find and fix actual root causes. We have also validated this technique by localizing real bugs that occurred in OpenStack.
KW - distributed system
KW - failure diagnosis
KW - Fault injection
KW - fault localization
KW - processing flow
UR - http://www.scopus.com/inward/record.url?scp=85010000259&partnerID=8YFLogxK
U2 - 10.1109/TPDS.2016.2575829
DO - 10.1109/TPDS.2016.2575829
M3 - Article
AN - SCOPUS:85010000259
SN - 1045-9219
VL - 28
SP - 503
EP - 516
JO - IEEE Transactions on Parallel and Distributed Systems
JF - IEEE Transactions on Parallel and Distributed Systems
IS - 2
M1 - 7484300
ER -