🤔Availability Strategies: Prevent Faults to become Failures.
👷♂️ Software Architecture Series — Part 10.
📍A software should be present and ready to do designated task whenever required. If there is a fault, it should be able to repair itself. A system should be able to mask or repair faults such that they don’t become failures. Okay! But the terminology seems to be a bit confusing! Faults, failures, how are they different exactly?
💨Let’s clear the air of confusion:
🚩A failure is the deviation of the system from its specified behavior and visible to an external observer. System failures are end results of fault(s). A fault can be either internal or external to the system under consideration. It could be anything which causes disruption to a system’s set of behaviors. Hardware faults such as such as a disk failure, memory corruption, etc. can lead to system crashes. There could be bugs or defects within a software code, which may cause programs to crash, produce incorrect results, or behave unexpectedly. There can be human induced faults as well as environmental faults such as power outages, etc.
🔍Detecting, diagnosing, and mitigating faults is crucial to ensure system reliability and minimize downtime. When a fault occurs, it doesn’t necessarily lead to a system failure immediately. Instead, faults might remain latent until triggered by specific conditions, leading to failures. So, a good software system should be designed in such a way that faults can be prevented, tolerated, removed, or forecast. In short, a system should be ‘fault-resilient’.
⚒It is indeed a challenging task to design a high availability fault tolerant system as it requires understanding of nature of failures that can arise during operation. Once failure possibilities are understood, mitigation strategies can be incorporated into the system. These strategies are designed to have one of the three purposes: fault detection, fault recovery, or fault prevention.
👁Fault Detection: To take action against a fault, system needs to detect or anticipate it first. Yeah, so obvious! Let us discuss ways we can detect or anticipate faults in a system.
· Monitoring: System monitoring can be leveraged to keep vigilance on health of various parts of system including processors, processes, I/O, memory, network, etc. System monitor can leverage other strategies like faulty timestamps or missed heartbeats to anticipate faults.
· Ping: It is used to detect round trip delay and determine reachability of associated network paths between nodes in a system through an asynchronous request/response message pair. If System monitor receives a response from a pinged component, it indicates component is live. Heartbeat can be considered special case of Ping strategy. In systems where scalability is a concern, minimizing transport and processing overhead is crucial for efficient operation. Heartbeat messages are used to indicate that a system or component is alive and functioning. They are often exchanged at regular intervals to monitor the health of nodes within a network or a distributed system.
In traditional setups, systems might exchange separate messages solely for heartbeat signals, which adds to the overhead in terms of network traffic and processing. By piggybacking, these signals ride along with other control messages, reducing the need for additional communication solely for health checks. However, it’s essential to consider potential trade-offs. Piggybacking heartbeat messages onto other control messages may increase the complexity of message processing and introduce challenges in accurately interpreting and separating different types of messages. There’s also the risk that if the primary message fails or gets delayed, the heartbeat might not get transmitted or checked on time.
· Timestamps: In distributed systems, where multiple processes or nodes communicate by passing messages, establishing a chronological order of event is very critical. Each process in a distributed system has its own local clock, which might not be synchronized perfectly across all nodes due to network latency or clock drift. When an event occurs at a particular process, it can assign a timestamp to that event based on its local clock’s time. This timestamp signifies the order in which the event occurred concerning the local time of that process.
Alternatively, sequence numbers can also used to uniquely identify events or messages exchanged between nodes. Each event or message is assigned a sequence number that increments in a strictly monotonically increasing manner. Sequence numbers are independent of local clocks and thus overcome the issues related to clock synchronization or inconsistencies in timestamping.
Sometimes, systems use a combination of both timestamping and sequence numbers to enhance event ordering. They might use sequence numbers to establish the global order of events while using timestamps for additional context or resolution in cases where events are very close in sequence.
· Sanity Check: Based upon knowledge of internal design or system state, sanity checks can be performed over specific operations or output of a component.
♻Recover from Faults: Once faults are detected; the next logic step would be to recover from faults and get normal operation back in order. Let us discuss some of the strategies which can be employed to recover from faults:
· Redundant spare: System can have one or more duplicate components which can step into action and take over the work if the primary component fails.
· Rollback: System can be rolled back to a previous working state upon detection of a failure so that normal operation can resume as fast as possible.
· Exception Handling: Some programming environments can rely on exception classes to recover from unexpected exceptions and state a defined and safe behavior while handling exceptions.
· Software Upgrades: Deprecated functions or classes can be replaced with new code so that normal operation is restored in the system, without affecting services. Bug fixes are usually handled in this fashion with regular software patch upgrades.
· Graceful degradation: Systems can choose to maintain critical components while dropping less critical ones, so that individual component failure does not transcend into complete system failure.
🛡Prevent Faults: So far, we have discussed how to detect faults and recover from them. But a vigilant system should be able to prevent fault from occurring in the first place. Let us discuss some strategies which can detect failure and help us take decisive actions at real time:
· Service removal: Say in regular sanity check of system, you find logs which point towards a system having potential sideways behavior. It would be a good strategy if that particular component of the system is taken out of service, specific patches are applied and then restored back to service.
· Predictive model: The system continuously collects and analyses various operational metrics that reflect the state and behavior of the system. These metrics can include session establishment rates, resource utilization thresholds (high and low watermarks), process states (e.g., in-service, out-of-service), and message queue lengths. A predictive model, often based on machine learning algorithms or statistical analysis, is trained using historical data of these operational metrics. The model learns patterns, trends, or anomalies in the data that are indicative of impending faults or deviations from normal behavior.
The monitoring system continuously observes the real-time operational metrics and compares them against predefined thresholds or patterns established by the predictive model. These thresholds define the normal operational range within which the system is expected to perform optimally. When the observed metrics start deviating or approaching critical thresholds as predicted by the model, the monitoring system flags or alerts that corrective action might be necessary. This proactive alerting allows for pre-emptive measures to prevent faults or performance degradation. Upon receiving alerts or detecting potential issues, the system can automatically trigger corrective actions or notify operators to intervene and address the emerging problem. This might involve load balancing, resource allocation, system reconfiguration, or other proactive measures to prevent system failure or performance degradation.
For example: In an HTTP server, if the session establishment rate drops significantly, the system might predict an impending issue and take actions to optimize server performance or allocate additional resources.
· Exception Prevention: This strategy is used to anticipate potential issues and implement mechanisms to avoid errors or handle them gracefully. Error-Correcting Code (ECC), Abstract Data Types (ADTs) like Smart Pointers, Bound Checking, Automatic Resource Management, etc. are some common strategies which fall under this category.
· Increase Competence Set: This strategy aims towards increasing a system component’s capability to handle more fault cases as part of normal operation. For example, we can develop the system component to manage resources effectively. This includes proper resource allocation, handling resource contention, and ensuring timely release of resources. Components that can adapt to changes in resource availability without halting their operation have a broader competence set (set of states in which it is “competent” to operate).
🛒Summary: There is no one solution fit for all use cases when it comes down to taking measures to ensure availability, but having a good domain knowledge as well as information about internal working of the system components and network architecture, help in taking decisive actions or preventive measure to be better prepared to handle faults or recover from it. If the system is vigilant enough to employ strategies like predictive modelling, then it very well may avoid potential faults.
#softwarearchitect #availability #architecture #softwaredevelopment #design #reliability