Toward achieving operational excellence in a cloud

S. A. Baset, L. Wang, B. C. Tak, C. Pham, C. Tang

Research output: Contribution to journalArticlepeer-review

3 Scopus citations

Abstract

A cloud pools resources such as compute, network, and storage and delivers them quickly and automatically on-demand through software. In addition, it provides automatic and policy-driven management of resources through software. Such a system comprises many components, whose states change rapidly. To manage it effectively, cloud service providers need to clearly understand the behavior of operations across components, and be able to fix errors as early as possible. The task of building such capabilities (referred to as operational excellence) in a cloud system is challenging because components maintain internal state and interact in non-intuitive ways to perform automated operations. In this paper, we discuss the concept of operational excellence for a cloud system, discuss the challenges in achieving the operational excellence, and describe our vision. Toward our vision, we present a set of techniques to determine the causal sequences of system events across distributed components. We also model configured system states using casual sequences of system events, gather observed system states, and continuously verify the configured and observed states across system components. We apply these techniques to study OpenStack®, an open source infrastructure-as-a-service platform.

Original languageEnglish
Article number6798708
JournalIBM Journal of Research and Development
Volume58
Issue number2
DOIs
StatePublished - 2014

Fingerprint

Dive into the research topics of 'Toward achieving operational excellence in a cloud'. Together they form a unique fingerprint.

Cite this