Summary of "Implementing Fault-Tolerant Services" || hariswb

Implement Fault-tolerant Services Using State Machine Approach is a tutorial by Fred B. Schneider. I’d like to summarize what state machine approach is and how it works for distributed system.

Distributed system, compared to a single server service, promises high service availability to client requests. This is achieved by distributing replicas of server to isolated processors. The replicas can fail, but they fail independently from one another. The whole system failure is seen from the perspective of components failure. When a component is faulty, it’s behavior no longer consistent with its specification. Component failure is a certainty, but to what extent is it tolerable and how to manage it?

State machine is defined as having state variables and a set of command to interact with them. It works deterministically. Given a sequence of requests into the machine, the result is always the exact same output. This deterministic property is the base for evaluating whether a distributed system is fault tolerant or not.

State machine approach in a distributed system is as an ensemble of state machine replicas. They are assumed to have agreement and order rule. It basically says that all replica receives all requests and process them in the same order. When a replica behaves outside of this assumption, it’s said to be faulty.

The author introduces t fault-tolerant concept, t being the number of allowed faulty replicas in the system for it to work as specified. We can obtain this number exactly by knowing the type of failures in our system, is it byzantine failure and fail-stop failure? With this number, we can know how many replicas we need for the system to be fault tolerant. For byzantine failure, replicas should be as many as 2t+1 because this way the majority of outputs should remain correct. For fail stop failures, it’s t+1 because one non faulty replica will remain among those.

The agreement and order assumption, which has the notion of time, sets the ground for a test for failure detection. It’s implemented by mapping the processed events to time which is either logical or real-time clock. Consider client event r which comes before r'. All replicas receive them and they assign time to the event T(r) < T(r'), T being a mapping of event r to time. The failure detection follows this assumption, implying that non-faulty replica always ignores event with lower timestamp. A state machine won’t accept event r after it has processesed event r'.

Generally, state machine approach operates with the agreement and order assumption. The state machine replicas receive all events by clients and process them in the correct chronological order. We can have failure detection and ensure non-faulty replica always work with that assumption. This summary only covers this surface. Deeper into the tutorial, the author walks us through the implementation details on various scenarios involving the type of failures, clock types, and whether the faulty output matters in outside system or within the state machine replicas. This tutorial is a good introduction for designing fault tolerant service.