Predictable, System-Level Fault Tolerance in a Component-based Real-time Operating System Open Access
Downloadable ContentDownload PDF
As the processor feature sizes shrink, mitigating transient faults, a.k.a. Single-Event Upsets (SEUs) in low level system services has become a critical aspect of dependable system design, especially in systems with timing constraints. It becomes more challenging when these faults occur in system critical services, such as the scheduler and memory manager. A Real-Time Operating System (RTOS) must be able to continue to correctly perform its specified tasks despite the occurrence of system-level faults.The main goal of this research is to develop fault tolerance infrastructures in a component-based RTOS to protect against system-level transient faults. Based on different fault models, two infrastructures are developed to detect and recover from system-level faults respectively. We will demonstrate how the recovery infrastructure and its code can be automatically generated and deployed in a component-based RTOS to provide efficient, effective and predictable system-level fault tolerance. We will also demonstrate how the fault detection infrastructure can be extended for probabilistic system-level anomaly detection to further enhance system dependability, while still preserving system timing constraints (i.e.,not missing deadline).