IT博客汇 | Dependable Computing Systems

Dependable Computing Systems

MarkNV发表于 2014-12-03 20:17:00

The design of distributed system must meet real-time, fault-tolerance and security requriements. This article contains notes of 7 fundamental topics in building dependable distributed computing systems.

1. Clock Synchronization

Notes for Cristian, F. (1989), "Probabilistic clock synchronization", Distributed Computing (Springer) 3 (3): 146–158

$H$: hardware clock, $\rho$: maximum clock drift rate

$P$: slave's clock, send message ("time=?") to $Q$

$Q$: master's clock, send message ("time=",$T$) back to $P$

$T$: time on $Q$'s logical clock

$D$: half round trip delay measured on P's clock ($2d$ is the real round trip delay)

$min$: minimum transition time

$[T+min(1-\rho),T+2D(1+2\rho)-min(1+\rho)]$: value on $Q$'s clock when it receive

Proof: Let $t$ be the real-time when $P$ receives the ("time=",$T$) message from $Q$ and $C_{Q}(t)$ be the value displayed by $Q$'s clock at that time. Let $min+\alpha,min+\beta,\alpha\geq0,\beta\geq0$, be the real time delays experienced by the ("time=?") and ("time=",$T$) messages.

(1) $2d=2min+\alpha+\beta$

(2) $0\leq\beta\leq2d-2min$, implied by (1) since $\alpha$ and $\beta$ are both positive

(3) $C_{Q}(t){[T+(min+\beta)(1-\rho),T+(min+\beta)(1+\rho)]}$, by the definition of $\beta$, and the fact that $Q$'s clock can run at any speed in the interval $[1-\rho,1+\rho]$

(4) $C_{Q}(t)\in[T+min(1-\rho),T+(2d-min)(1+\rho)]$, by combining (2) and (3)

(5) $D(1-\rho)\leq d\leq D(1+\rho)$

(6) $C_{Q}(t)\in[T+min(1-\rho),T+2D(1+2\rho)-min(1+\rho)]$, by substituting (5) into (4)

$C_{Q}^{P}(T,D)=T+D(1+2\rho)-min\rho$: $P$'s best estimate of $Q$'s local time by midpoint of interval (6)

$e=D(1+2\rho)-min$: precision, maximum error $P$ can make

$U_{min}=min(1+\rho)$: minimum timeout delay, best reading precision $e_{min}=3\rho min$

Amortization: $C(t)=H(t)+A(t)$: logical clock

$A(t)=m*H(t)+N$: periodically computed adjustment function, where the $m$ and $N$ parameters are computed periodically as described bellow. If, at local rapport time $L$, a slave estimates that the master clock displays time $M$, $M\neq L$, the goal is to increase (if $M > L$) or decrease (if $M < L$) the speed of the slave clock $C$ so that it will show time $M+\alpha$ (instead of $L+\alpha$) $\alpha$ clock time units after rapport, where $\alpha$ is a positive clock amortization parameter. Since at the beginning and end of the amortization period the slave clock displays the values $L=H(1+m)+N$ and $M+\alpha=(H+\alpha)(1+m)+N$ at rapport, by solving the above system of equations we conclude that the parameters $m,N$ must be set to $m=(M-L)/\alpha$, $N=L-(1+m)*H$ for the $\alpha$ clock time units following rapport.

$em$ (external-master): maximum deviation of master clock from the external time

$ms$ (master-slave): maximum deviation of slave from the master clock

$es=em+ms$ (external-slave): maximum deviation of a slave from external time

$ss=2ms$ (slave-slave): maximum deviation of two slaves

$d(t)$: largest delay between the slave and master clock.

$d(t)=ms-(ms-e)/\alpha(1-\rho)*t+\rho*t$: Suppose the distance $d(t)$ between the master and slave clock is ms at synchronization. Then during amortization $d(t)=ms-|M-L|/\alpha'*t+\rho*t$, where $|M-L|\leq ms-e$ in order to meet precision specification and $\alpha'=\alpha*(1-\rho)$.

$C_{M}(t)\in{[T+min,T+2D-min]}$: When a slave $S$ receives a successful round trip of length $2D$ from the master $M$, the master clock $C_{M}$ is in this interval.

$C_{M}^{S}(T,D)=T+D$: Slave's best estimate of Master's local time

$e=D-min$: maximum reading error that $S$ can make

$e+\rho*t \leq ms$: We want this after amortization, where $e$ is error at end of amortization.

$dnr=\rho^{-1}(ms-e)$: maximum delay to the next synchronization point

$dna=\rho^{-1}(ms-e)-(1+\rho)kW$: $dnr$ + synchronization delay with $k$ tries each takes $W$ time units.

$DNA=(1-\rho)dna$: allowable delay measured by slave

$DNA_{max}=\rho^{-1}(1-\rho)ms-kW$

$DNA_{min}=\rho^{-1}(1-\rho)(ms-e)-kW$ ($e=U-min$)

$ms_{min}=U-min+\rho(1+\rho)kW$

$\rho^{2}$ can be ignored

2. Atomic Broadcast

Notes for Flaviu Cristian. "Synchronous atomic broadcast for redundant broadcast channels". Real-Time Systems, 2(3):195{212, September 1990. Also IBM Research Report RJ7203, December 1989 (revised April 1990).

In the absence of failures, any update $\sigma$ accepted for broadcast by a processor $s$ at time $T$ on its clock, is received and processed by any processor $q$ by time $T+\delta+\epsilon$ on $q$'s clock, where $\delta=P+O+C+I+P$ denotes the processor-to-processor message delay bound. Term $\epsilon$ has to be added because $q$'s clock can be as far as $\epsilon$ time units ahead of the clock of $s$.

Prompt forwarding rule: A processor forwards any new message it receives from a channel on all other channels as soon as it receives the message. $(f+1)+(n-1)f=nf+1$ messages are sent per broadcast.

Simple lazy forwarding rule: A sending processor $s$ enqueues any message $m$ to be broadcast on the out-adapters to channel $1,2,...,f+1$ in this (increasing) order. Let $p$ be an arbitray processor different from $s$ that receives $m$ and let $c$ be the hightest channel number on which $p$ receives a copy of $m$. If $c\geq f$ then $p$ does not forward the message, else, $p$ forwards $m$ on channel $c+1,...,f$. In the absence of failures, only $f+1$ (minimum) messages are sent per broadcast.

Unanimity property: If a processor $s$ initiates the broadcast of a message $m$ and some correct processor $p$ receives $m$, then any correct processor $q$ receives $m$.

$\Delta=\delta+\epsilon$: protocol termination time.

$T+\Delta$: delivery time for updates with timestamp $T$.

$(T,s,\sigma)$: message consists of timestamp, starter id and update content

If $U\geq{T+\Delta}$ then "late message". If $T\in{dom(H)}\&s;\in{dom(H(T))}$ then "deja vu message". Both type of messages will be discarded.

$h$: hop count, is incremented by 1 each time a processor forwards $m$.

$U < T+h(\delta+\epsilon)$: timely.

$(T,s,\sigma,h)$: message consists of timestamp, starter id, update content and hop count

$\Delta=[f/2](\delta+\epsilon)+(\delta+\epsilon)$, $\Delta\leq w+\delta+\epsilon$, where $w=k(\delta+\epsilon) [f=2k],w=(k+1)(\delta+\epsilon)[f=2k+1]$

If $U\geq{T+\Delta}$ then "late message". If $T\in{dom(H)}\&s;\in{dom(H(T))}$ then "deja vu message". If $U\geq T+h(\delta+\epsilon)$ then "too late to forward". All types aboves will be discarded.

Lazy forwarding rule: To initiate a broadcast, a sender $s$ enqueues message $(T,s,\sigma,1)$ on its out-adapters $1,2,...,f+1$ in this order. Let $(T,s,\sigma,h)$, $h\leq k$ be a message accepted by a processor $p\neq s$ and let $c$ be the highest channel on which $p$ receives a copy of the message by local time $T+h(\delta+\epsilon)$. If at $T+h(\delta+\epsilon)$ on $p$'s clock, $c < f+1-h$, then $p$ forwards $(T,s,\sigma,h+1)$ on channels $c+1,...,f+1-h$, else, $p$ does not forward. The worst case message cost in the presence of failures is $(f+1)+(f-1)(n-1)=(f-1)n+2$.

LEMMA. Let $(T,s,\sigma,h)h\leq k$ be a message accepted by processor $p$ and let $c$ be the highest channel on which $p$ receives a copy of the message by local time $T+h(\sigma+\epsilon)$. If at time $T+h(\sigma+\epsilon)$ on $p$'s clock $c\leq f+1-h$, then at least $h$ component failures have occured since the broadcast was initiated, else, at least $h-1$ failures have occurred.

If a node attempts to broadcast a message $m$ at time $T$, then at time $T+D$ either all correct nodes deliver $m$ or none of them delivers $m$ (atomicity). All messages delivered are delivered in the same order at each correct node (order). If the sending node is correct, then all correct nodes deliver $m$ at $T+D$ (termination).

3. Group Membership Management

Notes for Cristian F "Reaching agreement on processor group membership in synchronous distributed systems". 18th Int Conf on Fault-Tolerant Computing, Tokyo, Japan (June 1988)

$\delta=\eta+d+\eta$: maximum processor to processor datagram message delay

$\Delta=\eta+D+\eta$: maximum processor to processor atomic broadcast delay

$J=2\Delta$: maximum join delay.

Periodic broadcast membership protocol:

$D=\pi+\Delta+2\eta$: maximum failure detection delay.

$R>\pi-\Delta+2\eta$: by assuming $D < R+J$, where $R$ is maximum restart delay

Attendance list membership protocol:

$D=\pi+\gamma+3\eta+J$, where $\gamma=n\delta+\epsilon$, $D_k=2\pi+(k-1)\delta+\epsilon+J$

$R>\pi+\gamma+3\eta$, only $n$ message are needed

Neighbor surveillance protocol:

At time $O$, each member sends a "neighbor" message timestamped $O$ to its successor, each member receives at $E=O+\gamma'$ (neighbor confirmation time), message accepts if $O\geq E-\gamma'$. If a member detects that the last "neighbor" message it has accepted had an origin $O < E-\gamma'$, then a failure has occurred.

$D=\pi+\gamma'+3\eta+J$, where $\gamma'=\delta+\epsilon$, $D_k=(n-1)\pi+\gamma'+3\eta+J,k > 1$ k is the number of failures.

4. Real-time Resource Scheduling

Notes for C. L. Liu and J. W. Layland "Scheduling algorithms for multiprogramming in a hard real time environment", J. ACM, vol. 20, no. 1, pp.46 -61 1973

Theorem 1: A critical instant for any task occurs whenever the task is requested simultaneously with requests for all higher priority tasks.

Rate-monotonic priority assignment: Assign priority to tasks according to their request rates ($1/T_i$).

Suppose $T_1 < T_2$ and $\tau_1$ has higher priority, then $[T_2/T_1] C_1 + C_2 \leq T_2$. If $\tau_2$ has higher priority, then $C_1 + C_2\leq T_1$, since $[T_2/T_1] C_1 + [T_2/T_1] C_2 \leq [T_2/T_1] T_1 \leq T_2$.

Theorem 2: If a feasible priority assignment exists for some task set, the rate-monotonic priority assignment is feasible for that task set.

Utilization factor: $U=\sum\limits_{i=1}^m(C_i/T_i)$, where $C_i$ is computation time, $T_i$ is request period, $m$ is tasks number.

Upper bound of $U=m(2^{1/m}-1)$.

Theorem 7: For a given set of $m$ tasks, the deadline driven scheduling (ED) algorithm is feasible iff $(C_1/T_1)+(C_2/T_2)+...+(C_m/T_m)\leq 1$

5. Recovery by Rollback

Notes for D.L. Russell, "State Restoration in Systems of Communicating Processes", IEEE Transactions on Software Engineering, Vol. SE-6, No. 2, March 1980, pp. 183-194.

An asynchronous system is a finite collection of processes that interact with $SEND(m,x)$-$RECEIVE(m,x)$ primitives through a finite number of mailboxes (messagelists).

Consistent: for all mailboxes $m$, the sequence of messages received from $m$ is a prefix of the sequence of messages sent to $m$.

M-propagation: Propagation of state restoration due to sent messages that are received and later revoked.

R-propagation: Propagation of state restoration due to a received message is replaced back into mailbox and later received messages exist in the mailbox that make the system state inconsistent.

Theorem 1: If for all mailboxes $m$ of a system, the messages in $m$ are received by no more that one process, the system is R-normal.

Theorem 2: A commutative system is R-normal.

Theorem 3: If for all mailboxes $m$ of a system, either theorem 1 or theorem 2, the system is R-normal.

Theorem 4: Let $S$ be a R-normal system with $k$ processes whose system graph is acyclic. Then $S$ is domino-free.

$xPy$ iff any of following three cases are satisfied ($x,y$ are mailbox operations):

1) $x, y$ are in the same process and $x$ occurs before $y$.

2) $x, y$ are not in the same process, $x=SEND$ of some message $m_0$, and $y=RECEIVE$ of that same message $m_0$, i.e. $x \rightarrow m_0 \rightarrow y$.

3) $x, y$ are not in the same process, $x=RECEIVE$ message $m_1$ from mailbox $m$, $y=RECEIVE$ of a message $m_2$ from the same mailbox $m$, $m_1$ occurs before $m_2$ in the mailbox and the system is not R-normal.

$xQy$ if $xPy$ or $x$ occurs after $y$ in the same region of the recovery cache.

A mailbox operation $y$ is necessary undone if $bP^*y$. is unnecessary undone if $bQ^*y$ but not $bP^*y$.

Lemma 1: Let $S$ be an asynchronous system and suppose that the command $RESTORE(r)$ is executed, where $r$ is a recovery point of process $p_1$, then the earliest necessarily undone mailbox operation $x_i$ in process $p_i$ is as follows: a) $x_i$ is either a SEND operation or a RECEIVE operation, and $x_1$ is the first mailbox operation $b$ that occurs after $r$ in process $p_1$. b) $x_i,i\neq 1$, is a RECEIVE operation.

Lemma 2: Let $S$ be an asynchronous system. Let $z$ be a mailbox command that is necessarily undone, and let $w$ be a mailbox command such that $zQw$. Let $x_j$ be the earliest necessary undone mailbox operation of process $p_j$. Then $w$ us unnecessarily undone iff: a) $w$ occurs before $x_j$; b) either $x_j=z$ or $x_j$ occurs before $z$ and; c) $w,x_j$ and $z$ are all in the same cache region of process $p_j$.

Theorem 5: Let $S$ be an asynchronous system and suppose that a MARK command is executed just before every RECEIVE command. Than $S$ is domino-free and unnecessary undone number is $D=0$.

MRS process is a process where M,R,S are performed in an order of $(MARK;RECEIVE^*;SEND^*)^*$.

Theorem 6: Let $S$ be an R-normal, MRS system. Let the number of processes in the system be $k$ and let the number of RECEIVE commands that each process performs between any two MARK commands be less than or equal to $s$. Then $S$ is domino-free and $D\leq(k-1)(s-1)$.

Theorem 7: Let $S$ be an R-normal, MRS system. Let the number of RECEIVE commands in process $p_i$ that are executed between any two MARK command be less than or equal to $s_i$. Then $S$ is domino-free and $D\leq \underset{j}max \sum\limits_{j\neq i}(s_j-1)=\sum\limits_j(s_j-1)-\underset{j}min(s_j-1)$.

Theorem 8: Let $S$ be an R-normal, MRS system. Then $S$ is domino-free.

6. Task Management

Notes for A.K. Mok, "Task management Techniques for Enforcing ED Scheduling on Periodic Task Set", Proceedings of Fifth IEEE Workshop on Real-Time Software and Operating Systems(RTOS 1988), May 12-13, 1988, Washington D.C.

Use a HOH (H) to store ready tasks and a 2-3 tree (T) to store not ready tasks. The leaves of the 2-3 tree are heaps and each not ready task is stored as a node in one of the leaves. All the tasks in the same leaf (heap) have the same ready time (the parameter r in the task control block).

INSERT(x,H): Place x as far left as possible on the lowest level of H, starting a new level if necessary. If x has a higher priority (smaller integer) than its parent, exchange it with its parent. When comparing priority, use the priority of of the root if the element under comparison is a heap, else use the priority of the simple node. Repeat the comparison as many times as necessary until either the root of H is reached or no exchange is needed.

DELETEMIN(H): Suppose the root of H is a heap with two or more nodes. Call this heap h. Remove and return the root of h and rearrange h. Call this new heap h’. Push h’ as far down the HOH H as it will go. When comparing priority, use the priority of the root if the element under comparison is a heap, else use the priority of the simple node.

If the root of H is a simple node, remove and return that node and rearrange H as follows: Take the rightmost element at the lowest level of H and put it at the root of H. Then push this element as far down H as possible.

SELECT(D): Perform a DELETEMIN on H. The task returned is to be executed next.

POST($\pi$,D): If π has just been preempted, perform INSERT($\pi$,H). If π has just completed execution, update its parameters by setting $d$ to the end of the next period and $r$ to the start of the next period. Locate the leaf in T which has the same ready time $r$ and insert $\pi$ into this heap, using the deadline $d$ as the key. If T does not contain a leaf with the same ready time as $\pi$, create a heap with $\pi$ as its only node and insert this heap into T.

WAKEUP(t,D): Locate a leaf in T with ready time = t. If one exists (call it h), delete h from T and insert h into H by performing INSERT(h,H).

7. Priority Inversion

Notes for L. Sha, R. Rajkumar and J. P. Lehoczky, "Priority Inheritance Protocols: an Approach to Real-Time Synchronization", IEEE Transactions on Computers, vol. 39, no. 9, September 1990, pp. 1175-1185.

Task blockings:

1) Direct blocking.

2) Priority-inheritance blocking. (Will cause deadlock and chain blocking.)

3) Priority ceiling blocking: A task can enter its critical section if the task's priority is higher than the max ceiling priorities of all the resources that are currently in use by other tasks. (The ceiling priority of a protected resource is the highest priority of the task that may use the resource.) (Will not cause deadlock or chain blocking.)

Acknowledgement:

This article is my learning notes of course CS386C Dependable Computing Systems (Fall 2014), lectured by Prof. A.K. Mok.