IT博客汇 | Understanding Fault-Tolerant Distributed Systems

Understanding Fault-Tolerant Distributed Systems

MarkNV发表于 2014-09-05 14:17:00

Summary for F. Cristian, "Understanding Fault-Tolerant Distributed Systems", CACM, vol. 34, no. 2, February 1991, pp. 56-78.

1. Basic Architectural Concepts

1.1 Services, Servers and the "Depends" Relation

A computing service specifies a collection of operations whose execution can be triggered by inputs from service users (client) or the passage of time.

The operation defined by a service specification can be performed only by a server for that service. A server implements a service without exposing to users the internal service state representation and operation implementation details.

Servers implement their service by using other services which are implemented by other servers. A server $u$ depends on a server $r$ if the correctness of $u$'s behavior depends on the correctness of $r$'s behavior. Server $u$ is called a user, while $r$ is called a resource of $u$. What is a resource or server at a certain level of abstraction can be a client or a user at another level of abstraction.

1.2 Failure Classification

A server designed to provide a certain service is correct if, in response to inputs, it behaves in a manner consistent with the service specification.

A server failure occurs when the server does not behave in the manner specified.

An omission failure occurs when a server omits to respond to an input.
A timing failure occurs when the server's response is functionally correct but untimely. Timing failures thus can be either early timing failures or late timing failures (performance failure).
A response failure occurs when the server responds incorrectly: either the value of its output is incorrect (value failure) or the state transition that takes place is incorrect (state transition failure)).
A crash failure is that after a first omission to produce output, a server omits to produce output to subsequent inputs until its restart.
- An amnesia-crash occurs when the server restarts in a predefined initial state that does not depend on the inputs seen before the crash.
- A partial-amnesia-crash occurs when, at restart, some part of the state is the same as before the crash while the rest of the state is reset to a predefined initial state.
- A pause-crash occurs when a server restarts in the state it had before the crash.
- A halting-crash occurs when a crashed server never restarts.

1.3 Failure Semantics

When programming recovery actions for a server failure, it is important to know what failure behaviors the server is likely to exhibit. Since the recovery actions invoked upon detection of a server failure depend on the likely failure behaviors of the server, in a fault-tolerant system one has to extend the standard specification of servers to include, in addition to their familiar failure-free semantics (the set of failure-free behaviors), their likely failure behaviors, or failure semantics.

If the specification of a server s prescribes that the failure behaviors likely to be observed by s users should be in class F, it is said that "s has F failure semantics".

In general, the stronger the desired failure semantics, the more expensive it is to implement. Arbitrary failure semantic is the weakest.

1.4 Hierarchical Failure Masking

In the aspect of software, hierarchical failure masking is kind of like exception handling. In hierarchical systems relying on such servers, exception handling provides a convenient way to propagate information about failure detections across abstraction levels and to mask low-level failures from high-level servers.

If the server u at level j can provide its service despite the failure of r at level i we say that u masks r's failure.

1.5 Group Failure Masking

To ensure that a service remains available to clients despite server failures, one can implement the service by a group of redundant physically independent, servers, so that if some of these fail, the remaining ones provide the service.

While hierarchical masking requires users to implement any resource failure-masking attempts as exception handling code, with group masking, individual member failure are entirely hidden from users by the group management mechanisms.

Note that redundancy management for strong failure semantics is cheap, while for weak failure semantics is expensive.

This implies that we need to balance amount of failure detection, recovery and masking redundancy mechanisms at various levels of abstraction to obtain best overall cost/performance/dependability results.

2. Hardware Architectural Issues

2.1 Replaceable Hardware Unit

A replaceable hardware unit means a physical unit which fails independently of other units which can be removed from a cabinet without affecting other units (without disruption to higher level software servers), and can be added to a system to augment its performance, capacity, or availability.

Coarse granularity architecture: A replaceable unit includes several elementary servers, e.g., CPU, memory, I/O controller.
Fine granularity architecture: Elementary hardware servers are replaceable units.

2.2 Hardware Failure Semantics

What failure semantics is specified for hardware replaceable units that is usually assumed by operating system software?

CPU - crash
Bus - omission
Memory - read omission
Disk - read/write omission
I/O controller - crash
Network - omission or performance failure

2.3 Hardware Failure Semantics Enforcement

How is the specified hardware failure semantics implemented.

The most well understood technique is error-detecting code, it remains the choice method for detecting failures in storage and communication hardware servers such as memories, disks, buses and communication lines.

However, duplication and matching (lock-step duplication) is a better choice in complex circuits, such as CPUs and device and communication controllers based on off-the-shelf microprocessors.

There are several advantages of lock-step duplication:

Lock-step duplication provides a better guarantee of crash failure semantics for complex servers. For CPUs and I/O controllers based on error-detecting codes there is a possibility that the data written to a bus or storage during the last "few" cycles before a failure detection is erroneous. However, duplication and matching by using self-checking comparator circuits virtually eliminates the possibility of such damage. The cost for duplication and matching method is that two physical hardware servers plus the comparison logic are needed instead of only one elementary server augmented with error-detecting circuitry.
The absence of error-detecting circuitry in elementary physical servers reduces their complexity, leading to increased reliability and reduced designed and testing cost. Moreover, it also makes the server faster.
Another reason for using lockstep duplication is the availability of cheap fast microprocessors which do not have much error-detection circuitry.
One last advantage of lock-step duplication is improved software quality growth. When any elementary hardware server failure is promptly detected as a disagreement before any data damage occurs, it is easier to determine whether failures were caused by software design faults or by physical faults.

2.4 Hardware Failure Masking

At what level of abstraction are hardware replaceable unit's failure masked?

Masking at hardware level (e.g., Stratus)
- Redundancy at the hardware level.
- Duplexing CPU-servers with crash failure semantics provides single-fault tolerance.
- Increases mean time between failure for CPU service.
Masking at operating system level (e.g., Tandem process groups)
- Redundancy at the O.S. level.
- Hierarchical masking hides single CPU failure from higher level software servers by restarting a process that ran on a failed CPU in a manner transparent to the server.
Masking at application server level (e.g., IBM XRF, AAS)
- Redundancy at the application level.
- Group masking hides CPU failure from users by using a group of redundant software servers running on distinct hardware hosts and maintaining global service state.

3. Software Architectural Issues

3.1 Software Servers

Software servers are analogous to hardware replaceable units. These are the basic units of failure, replacement, and growth of software. As with hardware replaceable units, the ultimate goal is to enable software servers to be removed from a system without disrupting the activity of the users.

If masking impossible or not economical, ensure "nice" failure semantics (which will allow higher level users, possibly human to use simple masking techniques, such as "login and try again")

3.2 Software Failure Semantics

If service state is persistent (e.g. ATM), servers are typically required to implement omission (atomic transaction, at-most-once) failure semantics.

If service state is not persistent (e.g., network topology management, virtual circuit management, low level I/O controller),then crash failure semantics is sufficient.

3.3 Software Failure Semantics Enforcement

To implement atomic transaction or crash failure semantics, the operations implemented by servers are assumed to be at least partially correct.

A program is totally correct if it behaves as specified in response to any input as long as the services it uses do not fail.

A partially correct program may suffer a crash or a performance failure for certain inputs even when the services it uses do not fail.

3.4 Software Failure Masking

Functional redundancy (e.g., N-version programming, recovery blocks)
Software server groups. The use of software server groups raises a number of issues are not well understood.
- How do clients address service requests to server groups?
- What group-to-group communication protocols are needed?

Acknowledgement:

This article is my learning notes of course CS386C Dependable Computing Systems (Fall 2014), lectured by Prof. A.K. Mok.