Summary for F. Cristian, "Understanding Fault-Tolerant Distributed Systems", CACM, vol. 34, no. 2, February 1991, pp. 56-78.
A computing service specifies a collection of operations whose execution can be triggered by inputs from service users (client) or the passage of time.
The operation defined by a service specification can be performed only by a server for that service. A server implements a service without exposing to users the internal service state representation and operation implementation details.
Servers implement their service by using other services which are implemented by other servers. A server $u$ depends on a server $r$ if the correctness of $u$'s behavior depends on the correctness of $r$'s behavior. Server $u$ is called a user, while $r$ is called a resource of $u$. What is a resource or server at a certain level of abstraction can be a client or a user at another level of abstraction.
A server designed to provide a certain service is correct if, in response to inputs, it behaves in a manner consistent with the service specification.
A server failure occurs when the server does not behave in the manner specified.
When programming recovery actions for a server failure, it is important to know what failure behaviors the server is likely to exhibit. Since the recovery actions invoked upon detection of a server failure depend on the likely failure behaviors of the server, in a fault-tolerant system one has to extend the standard specification of servers to include, in addition to their familiar failure-free semantics (the set of failure-free behaviors), their likely failure behaviors, or failure semantics.
If the specification of a server s prescribes that the failure behaviors likely to be observed by s users should be in class F, it is said that "s has F failure semantics".
In general, the stronger the desired failure semantics, the more expensive it is to implement. Arbitrary failure semantic is the weakest.
In the aspect of software, hierarchical failure masking is kind of like exception handling. In hierarchical systems relying on such servers, exception handling provides a convenient way to propagate information about failure detections across abstraction levels and to mask low-level failures from high-level servers.
If the server u at level j can provide its service despite the failure of r at level i we say that u masks r's failure.
To ensure that a service remains available to clients despite server failures, one can implement the service by a group of redundant physically independent, servers, so that if some of these fail, the remaining ones provide the service.
While hierarchical masking requires users to implement any resource failure-masking attempts as exception handling code, with group masking, individual member failure are entirely hidden from users by the group management mechanisms.
Note that redundancy management for strong failure semantics is cheap, while for weak failure semantics is expensive.
This implies that we need to balance amount of failure detection, recovery and masking redundancy mechanisms at various levels of abstraction to obtain best overall cost/performance/dependability results.
A replaceable hardware unit means a physical unit which fails independently of other units which can be removed from a cabinet without affecting other units (without disruption to higher level software servers), and can be added to a system to augment its performance, capacity, or availability.
What failure semantics is specified for hardware replaceable units that is usually assumed by operating system software?
How is the specified hardware failure semantics implemented.
The most well understood technique is error-detecting code, it remains the choice method for detecting failures in storage and communication hardware servers such as memories, disks, buses and communication lines.
However, duplication and matching (lock-step duplication) is a better choice in complex circuits, such as CPUs and device and communication controllers based on off-the-shelf microprocessors.
There are several advantages of lock-step duplication:
At what level of abstraction are hardware replaceable unit's failure masked?
Software servers are analogous to hardware replaceable units. These are the basic units of failure, replacement, and growth of software. As with hardware replaceable units, the ultimate goal is to enable software servers to be removed from a system without disrupting the activity of the users.
If masking impossible or not economical, ensure "nice" failure semantics (which will allow higher level users, possibly human to use simple masking techniques, such as "login and try again")
If service state is persistent (e.g. ATM), servers are typically required to implement omission (atomic transaction, at-most-once) failure semantics.
If service state is not persistent (e.g., network topology management, virtual circuit management, low level I/O controller),then crash failure semantics is sufficient.
To implement atomic transaction or crash failure semantics, the operations implemented by servers are assumed to be at least partially correct.
A program is totally correct if it behaves as specified in response to any input as long as the services it uses do not fail.
A partially correct program may suffer a crash or a performance failure for certain inputs even when the services it uses do not fail.
Acknowledgement:
This article is my learning notes of course CS386C Dependable Computing Systems (Fall 2014), lectured by Prof. A.K. Mok.