Wednesday, 8 February 2012

Basics of failover mechanism for critical software systems

For many years the systems are being designed to be immune to the failures. This means that when a system or a sub-system fails still the system/subsystem must be able to function to the maximum extent. The best example for this is Auto pilot and manual piloting of the Aircraft. Generally the modern Aircrafts fly in auto pilot mode over long distances in a pre designated path using onboard mission computers. When the auto pilot system malfunctions it will warn the pilots and switches over to manual mode. This will allow the aircraft to fly safely and land in nearest airfield in spite of the failure of a critical system.



There are systems which will also detect the impending failure and takes the corrective measures. The common example of failure detection is the POST (Power On Self Test) of a computer. When the POST finds that there is a problem with a component then it will reconfigure the firmware to use an alternate component.

The failover mechanism enhances the reliability and also the scalability (when configured in a way) of the system. These failover systems were earlier used in the field of Military and civil aviation, bio medical instrumentation, production factories, Etc… Failover systems are increasingly finding room in almost of all the modern and critical software systems in the enterprise software world.

Single point failure

There are degrees of recovery of a system that can recover from single or multiple failures of the sub system(s). Hence we can come to conclusion that we can introduce the term called “single point failure”. This means that should a pattern or a single failure occurs the whole system will fail. The best example of this is when the power chord of a computer breaks or removed; even if there are many power backups the computer will loose the unsaved data.

The systems are so critical that if they fail even for one hour there might be financial losses amounting to several millions of dollars. Assume that the payment gateway of VISA or Master Card fails, un-imaginable!! You cannot even buy medicine for emergencies and will have spiral effect on the society.

Degree of recovery

Degree of recovery means that a system’s ability to return to normalization in the case of single or multiple failures. The degree of recovery is based on the specific failure or pattern of failures. This is done using a table where the failure or pattern of failure and expected recovery are charted. This is some times called error channel analysis. This analysis is done at the time of design of the system.

The degree of recover is enhanced by using another dormant (redundant) system(s) or automatically taken care by a parallel system. The dormant system is the one which will be “sleeping” while an active system is running the job. The dormant system is also called redundant system. This system is made to “wake up” when there is a failure of the active system based on nature of failure. Some times the redundant system is only based on context. In the example from the first paragraph (auto pilot system) is best explainable for the context based redundant system. The manual pilot is not a redundant system but a main system which was put to “sleep” when the auto pilot is engaging and reverted back when pilot or when autopilot life span is complete or when autopilot fails.

Types of failover systems

We can broadly classify the failover system into two different categories.

1)      Dormant and Active System
2)      Parallel or Load Sharing System

Dormant and Active Systems

Dormant And Active System diagram

When a system or a sub-system fails the system switches to an alternate or redundant system or sub-system for running the Job. In the above diagram we can see that R1 is a critical system where it’s functioning is critical and must not stop by requirement. The idea behind this type of system is that by adding one or more redundant system to the system and in the case of failure or an active system the redundant system will take care of the ongoing function that failed one was doing. This is like having a backup resource (who understands and can perform as good as the regular resource) in case person is out on leave.

The blocks R1 is the critical system which is running (active state) and R2 is the redundant system which is in the “sleep” or dormant state. Sometimes these (R1 and R2) are also called as “channels”. The selectors S1 and S2 at the input and output ends of the redundant system will decide which channel to use (R1 or R2). These selectors (S1 and S2) continuously monitor the health of the active system. The S1 and S2 have the data regarding the health of dormant system (whether R2 is working and fit for functioning in case of failure of R1). When the selectors know that R1 has failed then they switch on R2. The question here is that what if the selectors (S1 and S2) fail! And also all the system fail (R1 and R2). This is generally determined by the design of these systems and Mean time to failure (MTTF) and Mean time between failures (MTBF) are determined and thoroughly tested.

The advantage of this is that it will ensure the failsafe aspect of the design. Though there are some disadvantages inherited by the design which does not allow the scaling up the productivity along side the failover mechanism. When one system is active other is inactive hence reduces the productivity. Imagine that autopilot and the manual pilot are engaged simultaneously what would happen to the aircraft. So this type of failover system is used in places where multiple systems cannot be used in parallel like in the case of autopilot and manual pilot systems. But the Autopilot internally can use parallel systems to improve reliability and increase immunity to failures.

Parallel or Load Sharing Systems

The parallel system is the one generally used in workload sharing mechanism which is inherently a failsafe system. When you go to any super market and when finished loading your carts and try to checkout, you see many checkout lanes. Now assume that a machine in one of the checkout terminal fails, you can still checkout using another lane. Here the checkouts per hour may reduce but the still you can get your work done. Additionally visualize a person assisting the customers to find a lane that is being processed having less people in the queue or the queue moving fast.

The below diagram shows the general view of the Parallel or Load sharing system

Parallel or load sharing system diagram

The typical example of this is the web application hosted on multiple clusters. The idea is that when a web server cannot dispense the application to the ever increasing load of the clients many other similar web servers are added. There is load balancer component determines the load on each of these nodes will automatically redirect to another node which may be relatively free. The requested is then processed by this node responded back to the client.

The above diagram illustrates the typical configuration of the parallel system. The R1 and R2 are the systems which work in parallel. The input is continuously processed by both of these systems. Note that the processing of the discrete inputs is little different than the analog inputs. The quantum of work is loaded into the queue and each of these nodes pickup the unit of work and start processing. There are subsystem (load balancer) which are available that actually manages the distribution of the work (discrete and/or self content unit of work definition). Assume that the R1 fails and R2 still running; the work item being processed by the R1 is still anonymous state. The system can respond by either being faulted and loose the work or the input sub system can react by assigning this to R2. The S1 and S2 selectors are showed in diagram as one level deep for better understanding.

The advantages of parallel systems are that increased throughput as there will be many systems that will be working in parallel. This also helps the scalability aspect of system design. One more advantage of this is that all the systems are productive unlike in dormant and active system.

I will be writing about my implementation of this in my next post using .Net.

2 comments:

  1. Yes. Perfect. Single Point Failures are undesirable in any system with a goal of high availability or reliability,

    ReplyDelete
  2. Yes, every system will have weakness; known and unknown. The whole exercise of engineering is about minimizing the impact of failure and improve the reliability; but cannot eliminate them.

    ReplyDelete