TY - JOUR
T1 - Toward achieving operational excellence in a cloud
AU - Baset, S. A.
AU - Wang, L.
AU - Tak, B. C.
AU - Pham, C.
AU - Tang, C.
PY - 2014
Y1 - 2014
N2 - A cloud pools resources such as compute, network, and storage and delivers them quickly and automatically on-demand through software. In addition, it provides automatic and policy-driven management of resources through software. Such a system comprises many components, whose states change rapidly. To manage it effectively, cloud service providers need to clearly understand the behavior of operations across components, and be able to fix errors as early as possible. The task of building such capabilities (referred to as operational excellence) in a cloud system is challenging because components maintain internal state and interact in non-intuitive ways to perform automated operations. In this paper, we discuss the concept of operational excellence for a cloud system, discuss the challenges in achieving the operational excellence, and describe our vision. Toward our vision, we present a set of techniques to determine the causal sequences of system events across distributed components. We also model configured system states using casual sequences of system events, gather observed system states, and continuously verify the configured and observed states across system components. We apply these techniques to study OpenStack®, an open source infrastructure-as-a-service platform.
AB - A cloud pools resources such as compute, network, and storage and delivers them quickly and automatically on-demand through software. In addition, it provides automatic and policy-driven management of resources through software. Such a system comprises many components, whose states change rapidly. To manage it effectively, cloud service providers need to clearly understand the behavior of operations across components, and be able to fix errors as early as possible. The task of building such capabilities (referred to as operational excellence) in a cloud system is challenging because components maintain internal state and interact in non-intuitive ways to perform automated operations. In this paper, we discuss the concept of operational excellence for a cloud system, discuss the challenges in achieving the operational excellence, and describe our vision. Toward our vision, we present a set of techniques to determine the causal sequences of system events across distributed components. We also model configured system states using casual sequences of system events, gather observed system states, and continuously verify the configured and observed states across system components. We apply these techniques to study OpenStack®, an open source infrastructure-as-a-service platform.
UR - http://www.scopus.com/inward/record.url?scp=84900317612&partnerID=8YFLogxK
U2 - 10.1147/JRD.2014.2298927
DO - 10.1147/JRD.2014.2298927
M3 - Article
AN - SCOPUS:84900317612
SN - 0018-8646
VL - 58
JO - IBM Journal of Research and Development
JF - IBM Journal of Research and Development
IS - 2
M1 - 6798708
ER -