Detecting Failures and Locating Faults in Global Scale Online Services Using Bayesian Networks Open Access
Downloadable ContentDownload PDF
The availability of large-scale online services (also known as cloud services) is becoming more important as more private and public infrastructure systems take dependencies on these services. The time it takes to detect that a failure has occurred, and the time taken to determine where the fault has occurred in a system are two important factors in determining the length of downtime experienced by a service. As well, determining where (or what) has failed in a complex system also drives the correct mitigation of the fault. Quickly and correctly identifying a fault in a complex system will reduce the amount of downtime experienced by the system – thus improving its availability. The purpose of this research was to compare the performance of a system (named Nova) that used ad hoc Bayesian networks to detect failures and locate faults in a complex, software as a service, cloud offering that is of global scale to the existing detection and scoping system (named Senex) used by this service. This comparison was done to determine if the use of on demand, ad hoc Bayesian networks were faster at detecting failures and locating faults in the system than Senex while maintaining reasonably accurate predictions of the fault scope. Actual service failure data were used for this study. The time that the Senex system took to detect the failure and scope the fault were recorded. The actual failing component was identified and recorded. A simulation was then conducted using the Nova system. The Nova system was passed actual probe data from the time of the failures and the time to detect and identify the failing component recorded along with the identity of the predicted failed components. It was found that the Nova system was significantly faster at detecting failures and identifying faults and that the Nova system was acceptably accurate in its predictions. More study is needed in creating more complex networks that can identify a larger set of issues than the Nova system was able to detect. Further areas of interest would be in identifying the owner of issues using Bayesian networks and incorporating cost into the analysis.