Stochastic OR

6.4 Server Failures

In Section 6.2 and Section 6.3 we assumed that servers are never interrupted while serving a job. However, this assumption is not always satisfied; for example, a machine may fail in the middle of processing a job. In this section, we develop a model1 There is not much point in combining the models of Section 6.2--Section 6.4 into one large model. For example, when a machine can be set up while a part is being repaired, there may not be time wasted on setup. Everything depends on specific strategies to combine outages. to calculate the influence on the mean waiting time of such preemptive outages, again based on Sakasegawa's formula for the \(G/G/1\) queue.

As in Section 6.3, we only have to derive expressions for the expectation and variance of the effective service time \(S\). The other components of Sakasegawa's formula are not affected.

We assume that a failure can occur at any time during the service time of a job. The repair times \(R_i \sim R\) have common mean \(\E R\) and finite second moment \(\E{R^{2}}\). Supposing that \(N\) interruptions occur during the net job service time \(S_0\), the effective service time becomes2 In an insurance context, if we interpret \(\{R_i\}\) as a set of claims, then \(\sum_{i=1}^N R_i\) is the total claim size of \(N\) claims. Likewise, in an inventory system, this sum is the total demand of \(N\) customers.

\begin{equation*} S= S_0 + S_N = S_0 + \sum_{i=1}^N R_i. \tag{6.4.1} \end{equation*}

In general, \(N\) is a random number, so we need to adjust for this in the computation of the mean and variance of \(S\).

It is common to assume that the time between two interruptions is memoryless with mean \(1/\lambda_f\), where \(\lambda_f\) is the failure rate. Defining the MTTF3 MTTF := mean time to fail. as \(m_{f} = 1/\lambda_f\), then the availability is given by

\begin{equation*} A:=\frac{m_f}{m_f + \E R}, \end{equation*}

from which follows easily that 4 Ex 6.4.7.

\begin{equation*} A=(1+\lambda_f \E R)^{-1}. \tag{6.4.2} \end{equation*}

Next, because by assumption the time between two failures is \(\sim \Exp{\lambda_{f}}\) so that \(N\sim \Pois{\lambda_{f} S_{0}}\) if \(S_{0}\) is given,

\begin{equation*} \E N = \E{\E{N|S_{0}}} = \E{\lambda_{f}S_{0}} = \lambda_{f} \E{S_{0}}. \tag{6.4.3} \end{equation*}

Therefore,

\begin{align*} \E S &= \E{S_0} + \E N \E R = \E{S_0} + \lambda_f \E{S_0} \E R \\ &= \E{S_0}(1+\lambda_f \E R) = \E{S_{0}}/A. \end{align*}

With this expression for the expected effective service time, the load becomes

\begin{equation*} \rho = \lambda \E S = \lambda \frac{\E{S_0}}A. \end{equation*}

Clearly, \(A\in (0,1]\), so the load increases due to failures.

With some more work, we obtain5 Ex 6.4.8. from Eq. (6.4.1),

\begin{align*} \V{S} &= \frac{\V{S_0}}{A^2} + \lambda_f \E{R^2} \E{S_0}. \tag{6.4.4} \end{align*}

If the repair times are exponentially distributed, it is possible to find a neat expression for the scv of the effective service time:6 Ex 6.4.10.

\begin{equation*} C_s^2 = C_0^2 + 2 A(1-A) \frac{\E{R}}{\E{S_0}}, \tag{6.4.5} \end{equation*}

where \(C_0^2\) is the scv of \(S_0\).

TF 6.4.1

The normal service time of a job, without interruptions, is given by \(S_0\). The durations of interruptions are given by \(R_i \sim R\). If \(N\) interruptions occur, the effective service time will then be

\begin{equation*} S= S_0 + \sum_{i=1}^N R_i. \end{equation*}

Then all steps in the computation below are correct:

\begin{align*} \E{\sum_{i=1}^N R_i} &= \E{ \sum_{n=0}^\infty \1{N=n}} \E{ \sum_{i=1}^n R_i } = \E N \E R \end{align*}
Solution
Did you actually try? Maybe see the 'hints' above!:
Solution, for real

False. The two expectations in the middle cannot be split like this.

TF 6.4.2

Assume that server failures are memoryless and occur at a rate of \(6\) per hour. Moreover, assume that the expected time to fix a failure is \(18\) minutes. An investment can be made which will result in a failure rate of \(18\) per hour and an expected repair time of \(6\) minutes per failure. Claim: This is a good investment.

Solution
Did you actually try? Maybe see the 'hints' above!:
Solution, for real

True. The service time is less variable.

TF 6.4.3

Claim: Server failures reduce the availability \(A\) but do not affect the effective service time \(\E S\).

Solution
Did you actually try? Maybe see the 'hints' above!:
Solution, for real

False. Since \(\E S = \E{S_0}/A\), a lower availability increases the effective service time, and consequently the load \(\rho = \lambda \E S\) increases too.

TF 6.4.4

Claim: If the net service time \(S_0\) is deterministic, then the effective service time \(S = S_0 + \sum_{i=1}^N R_i\) is also deterministic.

Solution
Did you actually try? Maybe see the 'hints' above!:
Solution, for real

False. Even with deterministic \(S_0\), the number of failures \(N\) and the repair times \(R_i\) are random, so \(S\) is random.

TF 6.4.5

Claim: In the model of this section, the scv of the effective service time satisfies \(C_s^2 \geq C_0^2\), where \(C_0^2\) is the scv of the net service time.

Solution
Did you actually try? Maybe see the 'hints' above!:
Solution, for real

True. From Eq. (6.4.5), \(C_s^2 = C_0^2 + 2A(1-A)\E R/\E{S_0}\), and the second term is non-negative.

Ex 6.4.6

Suppose that we have a machine with memoryless failure behavior, with a mean time-to-fail of \(m_{f}=3\) hours. Regular service times are deterministic with an average of \(S_{0}=10\) minutes, jobs arrive as a Poisson process with a rate of \(\lambda=4\) per hour. Repair times are exponential with a mean duration of \(\E R = 30\) minutes. What is the average sojourn time?

Hint

Be sure to work in a consistent set of units, e.g., hours. It is easy to make mistakes.

Solution
Did you actually try? Maybe see the 'hints' above!:
Solution, for real

Let us first check that \(\rho< 1\).

Python
labda = 4.0
ES0 = 10.0 / 60  # in hours
labda_f = 1.0 / 3
ER = 30.0 / 60  # in hours
A = 1.0 / (1 + labda_f * ER)
ES = ES0 / A
rho = labda * ES
print(f"{A=:.4f}, {ES=:.4f}, {rho=:.4f}")
Python
A=0.8571, ES=0.1944, rho=0.7778
Python
Ca2 = 1.0
C02 = 0.0  # deterministic service times
Ce2 = C02 + 2 * A * (1 - A) * ER / ES0
EW = (Ca2 + Ce2) / 2 * rho / (1 - rho) * ES
EJ = EW + ES
print(f"{Ca2=}, {EW=:.4f}, {EJ=:.4f}")
Python
Ca2=1.0, EW=0.5903, EJ=0.7847
Ex 6.4.7

Derive Eq. (6.4.2).

Hint

Observe that \(m_f = 1/\lambda_f\) and \(m_r = \E R\).

Solution
Did you actually try? Maybe see the 'hints' above!:
Solution, for real

The time to fail is the time between two interruptions. By assumption, the failure times are \(\Exp{\lambda_f}\), hence \(m_f = 1/\lambda_f\). The expected duration of an interruption is \(\E R\). With this

\begin{equation*} A=\frac{m_f}{m_f + \E R}=\frac{1/\lambda_f }{1/\lambda_f + \E R} = (1+\lambda_{f} \E R)^{-1}. \end{equation*}
Ex 6.4.8

Derive Eq. (6.4.4) using Adam and Eve's rule.

Hint

Condition on \(S_0\) and \(N\).

Solution
Did you actually try? Maybe see the 'hints' above!:
Solution, for real

When doing the calculations by hand, I worked from bottom to top. However, for the solutions, the current sequence is perhaps easier to understand.

\begin{align*} \E{S|S_0, N} &=\E{S_0 + \sum_{i=1}^N R \Big|S_0, N} = S_0 + N \E R, \text{ since \(\{R_i\}\) are iid},\\ \E{S|S_0} &= \E{\E{S|S_{0}, N}|S_0} = \E{S_0 + N \E R |S_{0}} = S_0 + \lambda_f S_0 \E R = S_{0}/A\\ \E{S} &= \E{\E{S|S_{0}}} = \E{S_0/A} = \E{S_0}/A.\\ \V{S|S_0, N} &= \V{S_0 + \sum_i^N R_i| S_{0}, N} = N \V R, \text{ as } \V{S_0|S_0} = 0,\\ \V{S|S_0} &= \E{\V{S|S_0, N} |S_0} + \V{\E{S|S_{0}N}|S_{0}} = \lambda_f S_0 \V R + \V{S_{0} + N \E R|S_0}\\ &= \lambda_f S_0\V R + (\E R)^{2} \V{N|S_0} = \lambda_f S_0\V R + (\E R)^{2} \lambda_{f} S_{0} = \lambda_{f} \E{R^{2}} S_{0}\\ \V S &= \E{\V{S|S_0}} + \V{\E{S|S_0}} = \lambda_f \E{R^2} \E{S_0} + \V{S_{0}}/A^{2}. \end{align*}
Ex 6.4.9

Show that

\begin{equation*} C_s^2 = \frac{\V{S}}{(\E S)^2} = C_0^2 + \frac{\lambda_f \E{R^2} A^2}{\E{S_0}}, \end{equation*}
Hint

Just realize that \(\E{S} = \E{S_0}/A\), and use the above.

Solution
Did you actually try? Maybe see the 'hints' above!:
Solution, for real
\begin{align*} C_s^2 &= \frac{\V{S}}{(\E S)^2} =\frac{V(S) A^2}{(\E{S_0})^2} =\frac{\E{S_0^2} + \lambda_f \E{R^2} \E{S_0}A^2 -(\E{S_0})^2}{(\E{S_0})^2} \\ &=\frac{\E{S_0^2} -(\E{S_0})^2}{(\E{S_0})^2} + \frac{\lambda_f \E{R^2} \E{S_0}A^2}{(\E{S_0})^2} =C_0^2 + \frac{\lambda_f \E{R^2}A^2}{\E{S_0}}. \end{align*}
Ex 6.4.10

Assuming that the repair times \(R\) are exponentially distributed, show Eq. (6.4.5).

Solution
Did you actually try? Maybe see the 'hints' above!:
Solution, for real

When repair times are exponentially distributed with mean \(\E{R}\), we have \(\E{R^2}=2(\E R)^2\). Since \(A=1/(1+\lambda_f \E R)\),

\begin{equation*} \begin{split} \lambda_f \E{R^2} A^2 &= 2\lambda_f (\E R)^2 A^2 = 2 \frac{\lambda_f \E R }{1+\lambda_f \E R} A \E R \\ &= 2 \left(1-\frac{1}{1+\lambda_f \E R}\right) A \E R = 2(1-A)A \E R. \end{split} \end{equation*}
Ex 6.4.11

How is the renewal reward theorem used in Eq. (6.4.2) to find an expression for the availability \(A\)?

Hint

Define suitable moments \(T_{1} < T_{2} < \ldots\) to inspect the system, and try to find a suitable function \(R(t)\) that captures what we want to measure.

Solution
Did you actually try? Maybe see the 'hints' above!:
Solution, for real

Take \(T_{k}\) as the end of the \(k\)th repair, and \(R(t) = \int_0^{t} u(s) \d s\), where \(u(s) = 1\) if the server is up (operational) and 0 if the server is down (broken). Thus, \(R(t)\) is the total amount of time the system has been up during \([0,t]\), and \(R(t)/t\) becomes the fraction of uptime (which is the availability) in the limit. For \(Z_{k}\), this is \(Z_k = R(T_k)-R(T_{k-1})\), which is the uptime in \([T_{k-1}, T_{k}]\). Therefore, in the limit, \(Z\) is the average uptime between two failures, hence \(Z \sim m_{f}\). Finally, the time between two inspections consists of an uptime and a breakdown. Hence, the rate at which such inspection epochs occur is \(1/(m_{f}+\E R)\).