Reliability of IEC 61850 based substation communication network architecture considering quality of repairs and common cause failures

Mission-critical IEC 61850 system architectures are designed to tolerate hardware failures to achieve the highest reliability performance. Hence, multi-channel systems are used in such systems within industrial facilities to isolate machinery when there are process abnormalities. Inevitably, multi-channel systems introduce Common Cause Failure (CCF) since the subsystems can rarely be independent. This paper integrates CCF into the Markov reliability model to enhance the model flexibility to investigate synchronous generator intra-bay SCN architecture reliability performance considering the quality of repairs and CCF. The Markov process enables integration of the impact of CCF factors on system performance. The case study results indicate that CCF, coupled with imperfect repairs, significantly reduce system reliability performance. High sensitivity is observed at low levels of CCF, whereas the highest level of impact occurs when the system diagnostic coverage is 99% based on ISO 13849-1, and reduces as the diagnostic coverage level reduces. Therefore, it is concluded that the severity of CCF depends more on system diagnostic coverage level than the repair efficiency, although both factors impact the system overall performance. Hence, CCF should be considered in determining the reliability performance of mission-critical communication networks in power distribution centres.


Introduction
Digitalisation of substations is increasing as industries have increased confidence in applying Substation Communication Networks (SCNs) for process automation. The system offers many advantages such as easy system diagnostics, reduced copper wire, less installation time, increased monitoring, and simplifying the process of effecting or implementing system design changes [1,2]. IEC 61850 is the latest standard for SCNs that enables peer-to-peer communication between substation devices, allowing a faster communication platform between Intelligent Electronic Devices (IEDs) to share critical interlocking and protection messages [3]. Moreover, the IEC 61850 standard also supports bay distributed functions. This ensures high reliability, because the loss of communication resulting from a switch failure at the station level does not render the intra-bay schemes inoperable. However, the industry is still cautious about the reliability of IEC 61850-based SCNs for the execution of mission-critical functions in power distribution centres of industrial facilities in cases of process abnormality [3,4].
The ability of IEC 61850-based SCNs to enable substation devices to share information is highly preferred compared to legacy communication protocols that only allow master-slave communication configurations. These do not support the peer-to-peer communication required for distributed mission-critical signal exchange [1,5]. The standard also addresses the challenges resulting from multiple substation communication protocols, including proprietary protocols that make the integration of substation devices even more challenging [1,2]. The reliability of IEC 61850-based SCN architectures has been explored at both component and system levels using many approaches based on combinatorial analysis methods to investigate composite reliability of the system, such as reliability block diagrams and failure mode effect analysis in the form of a state-space transition approach. However, these approaches fall short when it comes to establishing some of the requirements of the safety-related standard IEC 61508 for Electrical, Electronic and Programmable Electronic (E/E/PE) devices. Complete digitalisation of SCNs in industrial facilities, including power utilities, requires SCNs to interface to IEC 61508-based safety-related systems for exchanging mission-critical messages to ensure a safe and reliable power generation process [6][7][8].
Mission-critical IEC-61850 systems are designed to tolerate hardware failures to achieve the highest reliability performance, which is the prerequisite of the IEC-60870-4 standard [2,8,9]. Hence, multi-channel systems are used in mission-critical systems within industrial facilities to isolate machinery in the event of process abnormalities. These systems offer higher reliability than single-channel systems when their failures are independent between the channels. However, multi-channel systems introduce Common Cause Failure (CCF) since the subsystems are rarely independent. CCF factors reduce subsystem independence in multi-channel architectures [10][11][12][13][14], and therefore, incorporating CCF in reliability models is essential to ensure that meaningful and realistic results are obtained [11,15]. A CCF is defined as a single point of failure in a system that simultaneously causes a system subsystem to become non-functional. The failure could be caused by one or more components failing within a specified time, resulting in the whole system becoming inoperable [11,12,16].
Even though dependent failures are primarily due to CCF and cascading failures, both types of failures are modelled as CCF in the literature [14][15][16][17][18]. Hence, dependent failures occur as a result of common stressors that affect multiple subsystems or components within a system [10,11,16]. Common causes can result from root causes or coupling factors, where root causes are related to system design and engineering, manufacturing and installation, testing and commissioning, and operating and maintenance. Coupling factors, however, can be associated with the same physical location and design, the same hardware and/or software, or the same installation and maintenance teams [12,14,16,18,19]. Nevertheless, root causes are the main reason for component failures, whereas coupling factors make a component susceptible to the same root cause. Hence, mitigating root causes does not necessarily eliminate coupling factors, making the modelling of CCF complicated. Consequently, common modelling of CCF follows a fixed proportion estimation approach considering the subsystem overall failure rate as the probability of CCF occurrence. This does not require system-specific data of the CCF itself [16,20].
The consideration of CCF as hazards leading to system failure necessitates their careful evaluation in system reliability studies to ensure that the reliability performance of the system is not over-stated since theses hazards tend to increase joint system probability of failure. This leads to inaccurate system reliability evaluation [12,13,17]. Explicit modelling and analysis of the impact of CCF on the reliability and availability of a system can be a challenging task when the failure probabilities due to CCF are used in the development of the system reliability models [16,20,21]. Hence, various reliability models have been developed to ease the quantification and modelling effort for CCF. The models share one main objective even though their approaches may differ. This objective is to quantify the level of both the dependent and independent factors [11,15,16,20]. The contributions of this research are as follows: (a) Integrating CCF in the Markov process reliability model of mission-critical applications considering the quality of repairs. (b) Analysing IEC 61850-based SCN architecture reliability performance considering the quality of repairs for executing mission-critical functions. (c) Investigating SCN architecture's responsiveness to increasing CCF levels based on the sensitivity and elasticity of mean state transitions.
The remainder of the paper is as follows. Section 2 presents a critical review of IEC 61850 SCN architecture reliability studies. Section 3 provides an overview of synchronous generator protection system architecture and the study basis. The β-factor model is presented in Sect. 4. The modelling of CCF in systems with imperfect repairs and limited diagnostic coverage is presented in Sect. 5 based on a Markov process, while Sect. 6 discusses system reliability considering CCF based on mean system state transitions using the absorbing Markov Chain process and matrix calculus. Case study results and discussions are presented in Sect. 7, and the findings and conclusions are highlighted and discussed in Sect. 8.

Reliability and availability performance studies
In [22], the transmission performance of different data streams using an Optimised Network Engineering Tool (OPNET) is investigated and the approach focuses on the network architecture, while other investigations using OPNET have focused on the end-to-end delays of the messages on the network [23][24][25]. A comparative study of IEC 61850 editions I and II, to highlight the reliability enhancements in the edition II standard based on Parallel Redundancy Protocol (PRP) and Highly Available Seamless Redundancy (HSR) protocol, is presented in [26]. The PRP and HSR are considered deterministic because of their zero switchover time in a link failure case [27]. The accuracy of frame detection and discarding is presented and discussed in-depth in [28]. Even though the architectural analysis presented in [26] is comprehensive, it does not address the quality of repairs and the associated CCFs. In [29], the application of IEC 61850based Remote Terminal Units (RTUs) to integrate legacy devices in Substation Automation Systems (SAS) is demonstrated, but no reliability assessment is presented even though it claims that the selected architecture is reliable. Security issues concerning IEC 61850 SCN based on the IEC 62351-7 for network and system management are addressed in [30]. These issues are critical for the overall dependability of the SCNs. Strategies and methods of improving IEC 61850 based SCNs are addressed in [31], which highlights cost as the main hindrance to employing fully redundant systems, while it agrees that PRP and HSR offer high reliability and effectiveness, with HSR being more affordable than PRP. However, the quality of repairs and CCF impact associated with architecture complexity are not addressed. Integration of circuit measurements using Conventional Instrument Transformers (CIT) and Non-Conventional Instrument Transformers (NCIT) in SASs is addressed in [32]. The reliability of the two architectures using the Reliability Block Diagram (RBD) method are evaluated and it concludes that NCITs offer higher reliability than CITs considering PRP and HSR protocol architectures [33,34]. In [35], the reliability performance of the star, ring, star-ring and redundant ring architectures are comprehensively investigated employing the RBD method, and the advantages and disadvantages of the architectures, as well as their communication efficiency using OPNET are summarised. It states that, while the mathematical analysis resulting from using RBD enables detailed analysis of the network reliability performance, its drawback lies in its failure to consider the quality of repairs [2,8]. Another IEC 61850-based SCN architecture analysis using the RBD method is presented in [36], which does not consider the quality of repairs associated with the architecture in the case where a device failure occurs. Moreover, the discussed architectures' reliability assumes zero network switchover when network links fail even though the RSTP protocol is applied. This is impossible to achieve. In [37], an algorithm used to minimise traffic congestion in HSR is presented and discussed, though it does not discuss the impact of the quality of repairs. Reference [38] employs Monte Carlo Simulation to investigate the reliability of different IEC 61850-based SCN architectures.
Although the method is flexible in evaluating various impacts of failures and repairs, it only considers the reliability and availability of the SCN architectures assuming that the failure rate is nonconstant but follows the Weibull distribution without the repairs' quality impact. In [7], RBD is adopted to analyse system reliability at the bay level while state-space approach is used to construct a transition probability matrix. The state transition matrix is similar to the Markov transition probability Matrix, but their similarity is not discussed. In addition, repair quality is not considered and it assumes that all repairs are fully implemented. Therefore, although the studies present comprehensive research concerning the reliability of IEC 61850-based SCN architectures, the quality of repairs (viz. imperfect repairs and diagnostic coverage of the system) in determining the reliability of SCN architectures, is not considered.

Advanced IEC 61850 SCN architecture reliability studies
Reference [3] investigates the application of IEC 61850 SCNs in mission-critical safety-related systems using the Markov process. Results demonstrate that IEC 61850 can be considered for executing safety-related missions, whereas in [39], the performance of various SCN architectures is investigated using the Markov process, and it concludes that the performance is acceptable and economical. Reference [4] investigates IEC 61850-based SCNs for executing safety-related mission-critical commands based on the IEC 61508, which is the standard for safety-related systems, and concludes that the IEC 61850 standard can be considered for executing safety-related functional requirements. In addition, the research presented in [5] reveals that the IEC 61850 standard meets all the qualitative dependability requirements of the IEC 61508 as prescribed in IEC 61784-3. The impact of quality of repairs on the performance of SCN architectures and the basis for parameter optimisation are investigated in [40,41], whereas the responsiveness of the architectures' mean time to failure based on the mean system state transitions is investigated in [42,43]. However, CCF impact is not considered in these studies.

Suitability and flexibility of evaluation methods
The reliability performance of a mission-critical system needs to be modelled with high accuracy to ensure its performance. This cannot be achieved by combinatorial analysis methods [2,8]. The review in [44] states that the Markov process, Petri Nets and Monte Carlo Simulation methods can all be considered for investigating the reliability of a mission-critical system. Even though all the three simulation methods offer high accuracy, consideration on their flexibility, complexity, and ease of implementation in modelling system reliability, is needed. Petri Nets offer both state and transition modelling using places and arcs [45,46]. However, the method does not consider time and requires further translation into stochastic Petri Nets to simulate discrete systems. In contrast, the Markov process can model both discrete and continuous times naturally. Moreover, there is still insufficient information about the use of Petri Nets application integration, while the Markov process is commonly used to investigate the reliability of safety-related systems [44,45,47].
In contrast to the Markov process, the Monte Carlo Simulation method can model various individual parameter failure distributions by sampling multiple parameter values for computation, making it more flexible than the Markov process. Nevertheless, the said flexibility is not needed during a system's useful life where only exponential distribution is considered for E/E/PE systems. In addition, the Markov process offers more comprehension of the insights of system dynamics through its transition probability matrix, which enables various theoretical concepts for investigating the behavioural characteristics, including transient and asymptotic system response to system parameter changes [41][42][43]. The seamless transformation of the transition probability diagram into a transition probability matrix allows the integration of varied system parameters, enabling a holistic approach in studying the interaction of a system's subsystems, its environment, and human intervention through Systems Thinking [8,45,48,49]. In addition, unlike Monte Carlo Simulation, where a high number of simulations are required to obtain statistically meaningful results, the Markov process uses mathematical analysis of the transition probability matrix based on dynamical system studies and calculus methods [41][42][43]. Hence, the Markov process is most suitable for studying the reliability of mission-critical safety-related systems during their useful life because of its flexibility and accuracy, while also being simpler to implement than Petri Nets and Monte Carlo Simulation methods.

Overview of synchronous generator protection system SCN architecture and study basis
A simplified single line diagram of a synchronous generator with a 'one-out-of-two' IEC 61850-based protection scheme is presented. The scheme channels are based on star configured SCN architectures, where Merging Units (MU) are employed at the process bus to interface the Conventional Instrument Transformer (CIT) measurements to the respective scheme channels. Although it is common for the scheme to cover the auxiliary and generator step-up transformers, this paper focuses on the generator only because their SCN architecture concepts are similar. Figure 1 depicts the configuration of the SCN architectures on the generator, and Table 1  The RBD of the protection scheme architecture is depicted in Fig. 2, and considers the independence of the individual scheme channels. In order to incorporate the impact of quality of repairs, the scheme in Fig. 2. is remodelled using a Markov process, as depicted in Fig. 3 [8,40]. As shown, λ, µ, r eff , e dc and β represent the system failure rate, repair rate, repair efficiency, diagnostic coverage and common cause failure factor, respectively. State S-1 represents the fully functional state of the protection scheme, and states S-2 and S-3 represent a condition where only one of the scheme channels is available, whereas state S-4 represents a complete scheme failure. Consequently, the sum of states S-1, S-2 and S-3 probabilities is the system availability probability [40]. The integration of CCF impact resulting from the scheme location, engineering, design, manufacturing, installation and testing, commissioning and operating, and maintenance is presented in the following section based on the beta factor model.

The beta-factor model
The β-factor model is the most preferred and commonly used parametric method of evaluating the impact of CCF in 'one-out-of-two' system configurations [10,11,16]. The model is also presented and discussed in the IEC 61508 standard as one of the recommended methods of determining the effect of CCF in multi-channel systems. Modelling of CCF aims to determine their effect on system reliability and availability performance and enable the development of strategies against their impact [16,21]. Parametric models can be classified into shock and non-shock models, where shock models incorporate CCF basic mechanisms, while non-shock models are based only on the failure probabilities of CCFs. The β-factor model is based on an historical time to failure that is broadly applied. However, it is simplified since it does not explicitly account for individual sub-factors [50]. Nevertheless, considering that only the level of CCF is needed to determine the impact of common causes on the system reliability and that the channels under consideration are identical, the β-factor model can be used to model CCF in 'one-out-of-two' system configurations because its application is simple to comprehend and apply. Also, it reduces the effort needed to analyse the results [11,15,16]. As a single parameter model, the β-factor model assumes that a constant fraction of the system, subsystem or component failure rate can be attributed to the failure probability of the CCF [15,16]. Thus, the total system failure rate T is given by: where CCF represents the failure rate due to CCF while IND represents the failure rate due to independent components [20], which are given respectively as:   The estimation of the β-factor is based on system diversity or properties, as well as the architecture [21]. Figure 4 depicts a RBD model of a 'one-out-of-two' multichannel system comprising subsystems A and B, where A and B are their respective failure rates. Notably, the failure of any component represented by the failure rate function (2) causes the overall mission to fail. Hence, the RBD model offers an effortless comprehension of the β-factor model application. The model of Fig. 4 is redesigned using the Markov process introduced in Sect. 3 and described in [8,41,43], to enable the integration of CCF and imperfect repairs into the reliability model [2,8,49,54,55]. Figure 5 depicts the 'one-out-of-two' Markov state transition diagram model integrating the β-factor [21]. It is assumed that the CCF rate function f ( A , B ) given by (2) is an averaging function of the two subsystems' failure rates, such that the CCF rate is the fraction of the CCF function value determined by the β-factor. In comparison to the model presented in Sect. 3, the model depicted in Fig. 5 shows that a system state transition from state S-1 to S-4 is possible due to the presence of CCFs, of which the failure rate is given by (2). The complete state transition probabilities of the 'one-out-of-two' system model depicted in Fig. 5 are given as: The Markov state transition β-factor model and its associated state transition matrix are used to enhance the 'one-out-of-two' Markov diagram state model depicted in Fig. 3, to investigate the impact of CCF on the system reliability performance considering imperfect repair factors. The integration of the CCF effect on the 'one-out-of-two' model with imperfect repairs and limited system diagnostic coverage is presented in Sect. 5.

Modelling imperfect repairs and CCFs
The 'one-out-of-two' system model presented in Sect. 3 is enhanced by incorporating CCF using the β-factor model described by (4) for investigating the impact of imperfect repairs at different CCF levels. Figure 6 depicts the Markov 'one-out-of-two' system transition probability diagram with imperfect repairs and CCF [8,45,56]. The associated transition matrix of the model depicted in the transition diagram of Fig. 6 is given by: Equation (5) enables the investigation of system reliability performance analysis by observing the number of mean system state transitions at various levels of CCFs, depending on the selected value of the β parameter [8,21]. The model's flexibility to incorporate various factors allows the effectiveness of the CCF factors on system reliability performance to be determined at different levels of imperfect repairs (viz. quality of repairs as discussed in [8]). Henceforth, the subsystems are assumed not to be entirely independent. This is to improve the accuracy of the reliability performance evaluation results, except in exceptional cases where β is set to zero to represent the non-existence of CCF in the system [21].

Sensitivity and elasticity of system performance to common cause failures
The sensitivity of the system reliability performance to CCF can be determined by investigating the fundamental matrix's responsiveness to different CCFs levels. Given the transition probability matrix P , the fundamental matrix N is given by [8,42,45,57]: The identity matrix I represents the number of recurrent system states, and Q represents the probabilities of the transient system state [41,58,59]. It can be shown that the sensitivity and elasticity of the fundamental matrix are given by (7) and (8) using matrix calculus methods [58,60,61], where R is a vector of elements of interest and D(X) is a matrix whose diagonal entries are the elements of vector X.
The stochastic probability matrix P of the system depicted in Fig. 6 is given in (9) in its lower-level form, while the transient probability matrix Q of the system depicted in Fig. 6 is given in (10) based on P given in (9).
µ A e dcA r effA µ A e dcA r effA +( B +µ A −µ A e edcA ) µ B e dcB r effB µ B e dcB r effB +( where Qdn is: (12) and (13) enables the system reliability performance evaluation by careful observation of the system sensitivity and elasticity to CCFs. The notation and basics of calculus techniques applied in this paper are discussed in [42,43,58].

Case study results and discussions
This section presents the results and analysis of the impact of CCF on the reliability performance of the 'one-out-of-two' system configuration depicted in Fig. 6.
µ A e dcA r effA µ A e dcA r effA +( B +µ A −µ A e edcA ) µ B e dcB r effB µ B e dcB r effB The impact of CCF is investigated for the three levels of diagnostic coverage presented in ISO 13849-1. Table 2 presents the different system diagnostic coverage levels [62][63][64][65].
The following assumptions are made to ease the analysis of the case study results, recognizing that simulation and analysis of different subsystem repair efficiency levels and diagnostic coverage are possible considering a system with partial failure resulting in either subsystem A or B being unavailable.
(a) The two subsystems are of the same technology, hence they have the same diagnostic capability. (b) Identical resources support both subsystems such that equal repair efficiencies are applied to them. (c) The system is operational and without partial failures at the beginning of the simulation.
Even though the system is assumed to be operational and without partial failures at the beginning of the simulation, any system state can be selected as the system's initial state assuming a partial failure has occurred in either subsystem A or B. Figure 7 depicts the system transition probability heatmap at 90% diagnostic coverage and 95% level of repair efficiency. Selecting a level below 100% acknowledges that 100% repair efficiency is unlikely to be achieved. The CCF level β is considered at 10% to illustrate the system's characteristic behaviour.
In contrast to the system configuration discussed in Sect. 3, the system under consideration can transition into either states S-2, S-3 or S-4, with equal probability of transitioning into either state S-2 or S-3 considering S-1 as the initial state. Thereafter, the system will transition back to state S-1 except if it has transitioned into state S-4, which is the system's failsafe recurrent state. The likelihood that the system moves to state S-4 is relatively low, at about 0.05. This condition implies that the system is likely to move between states S-1, S-2 and S-3 before moving to state S-4. However, the system can transition to state S-4 at any time if one or more of the dependent failures occur. Hence, the system performance analysis under  States S-* X(t+1) consideration investigates the mean state transitions before failure as the state transitioning characteristics in its transient state.

High diagnostic coverage
The system diagnostic coverage level is assumed to be 99%, whereas its repair efficiency is 95%. Figure 8 depicts the reliability of the 'one-out-of-two' system shown in Fig. 6 for different levels of CCF represented by the β-factor. It can be observed from Fig. 8 that the system has the highest reliability performance level when the β-factor is zero, as a zero β-factor represents a condition where the subsystems A and B are assumed to be entirely independent of each other. Hence, the probability of the two subsystems A and B simultaneously failing is improbable. Nevertheless, the system reliability rapidly decreases with increasing CCF as the failure probability due to CCF increases, represented by the direct state transition from state S-1 to S-4.
The results also indicate that the reliability performance is sensitive to changes at low levels of the β-factor. Moreover, the change in the system probability performance curves can be precisely associated with different levels of mean state transitions, which in turn represents a change in system reliability level. Figure 9 depicts the reliability of the system when its subsystems have low repair efficiencies. The much-reduced level of repair effectiveness represents a high level of incomplete and/ or incorrect repairs carried out on the system. The scenario's objective is to investigate the impact of CCF on system reliability performance when the quality of repairs is deficient. Hence, the repair efficiency of the individual subsystems is considered as 50% for simulation purposes.
It is noticeable that the impact of CCF is relatively low for changes of the β-factor compared with that in the previous scenario. The impact also reduces as the level of CCF increases, as was the case with 95% repair efficiency. As expected, the system reliability becomes zero at fewer time steps, as seen in Fig. 9 for the different levels of CCF represented by the β-factor. However, CCF appear to have a smaller impact on system reliability at low repair efficiency levels than when efficiency is high. The system behaviour can be attributed to reducing the repair rates of the subsystems, which reduces the likelihood of the system moving from states S-2 and S-3 back to S-1, whereas the likelihood of the system moving to state S-4 increases.  Figure 10 depicts the mean state transitions at various levels of CCFs. It is notable that the mean system state transitions are highly sensitive to changes of β-factor, particularly at low levels of β. This indicates that the presence of CCF significantly reduces the performance regardless of the CCF level. This is similar to the various repair efficiency levels.

Medium diagnostic coverage
The system diagnostic capability is assumed as 90%, whereas the repair efficiency remains unchanged at 95%. Figure 11 depicts the reliability of the considered system for different levels of CCF represented by the β-factors. Again, it is noticeable from Fig. 11 that the system maintains the highest reliability performance level when the β-factor is zero, as in the scenario when the coverage was 99%. As expected, the reliability decreases with increasing CCF level as the system failure probability increases. The reliability of the system becomes zero at much lower system state transitions when more system faults remain hidden, than at the high diagnostic coverage of 99%. Figure 12 depicts the system reliability when its subsystems have low repair efficiencies of 50%. It can be seen that the impact of CCF is relatively uniform for the levels of the β-factors, and the relative impact is less than the scenario with repair efficiency of 95%. The impact also reduces uniformly as the level of CCF increases, as was the case with repair efficiency of 95%.
As expected, the system reliability becomes zero at fewer time steps, as depicted in Fig. 12 for the different levels of CCF represented by the β-factor levels. However, the impact of CCF appears to have a smaller effect on system reliability at low repair efficiency levels than the high level of 99%.
The system behaviour can be attributed to the reduction in the subsystem repair rates, which reduces the likelihood of the system moving from states S-2 and S-3 back to S-1, whereas the likelihood of the system moving to state S-4 increases. Figure 13 depicts the system mean state transitions at various CCF levels. As seen, the mean number of state transitions of the system is marginally sensitive to the changes of the β-factor level as expected. This observation is the same for the different levels of system repair efficiency as was the case with high coverage of 99% even though the number of transitions has significantly reduced, particularly at low levels of β.

Low diagnostic coverage
The system diagnostic coverage level is assumed to be 60% for this case study. Initially, the repair efficiency is 95%, as in the previous case studies. Figure 14 depicts the reliability of the system for different levels of CCF represented by the β-factor levels. It is noticeable again that the system has the highest reliability performance level when the β-factor is zero, as in the previous case studies with 99% and 90% diagnostic coverage levels. Contrary to the results obtained when the diagnostic coverages were at 99% and 90%, the system reliability only decreases marginally with increasing CCF.
Moreover, the reliability becomes zero at only 20 transitions compared to 950 and 90 transitions when the system diagnostic coverages were at 99% and 90% for β = 0, respectively, as more system faults remain hidden. In addition, the system is characterised by low sensitivity to changes in β levels. Figure 15 depicts the system reliability when its subsystems have low repair efficiencies of 50%.
The impact of CCF is relatively lower for changes in the β-factor level than with 95% repair efficiency. Again, the impact also increases as the level of CCF increases. The system reliability becomes zero at fewer time steps, as depicted in Fig. 15 for the different levels of CCF represented by the β-factor levels. Moreover, the impact of CCF on system reliability appears to be proportionally the same at all repair efficiency levels. The system behaviour can be attributed to the reduction in the repair rates of the subsystems. This reduces the likelihood of the system moving from states S-2 and S-3 back to S-1, whereas the likelihood of the system moving to state S-4 increases. Figure 16 depicts the mean state transitions at various CCF levels, indicating that they are relatively insensitive to the changes of the β-factor levels.

Sensitivity of system reliability
This section presents the sensitivity and elasticity analysis results of the system performance considering mean transitions based on an absorbing Markov chain process and calculus inferences. The symbol Sxy represents transitions into state S-y when the system's initial state condition is S-x. coverage is 99% for β = 0.1 and β = 0.5. It can be observed in Fig. 17a that the state mean transitions into state S-1 is the most sensitive at − 139.7, when the level of CCF level is 10%. The negative magnitude indicates that the incremental change in the CCF level causes the system's mean state transitions to decrease, which implies that the system reliability performance decreases. Again, it is noticeable that state S-1 is the most sensitive when β = 0.5, as depicted in Fig. 17b. However, the magnitude of the state transition sensitivity is reduced by − 7.6. Although state S-1 has the highest sensitivity, its elasticity is the least compared to moving into S-2 and S-3. This observation is similar for the two CCF levels. Nevertheless, the results depicted in Fig. 17 indicate that the system reliability performance is most sensitive to low β-factor levels when the diagnostic coverage is high. Figure 18 depicts the system responsiveness to CCF based on sensitivity and elasticity when the coverage is 90% for β = 0.1 and β = 0.5. It can be seen in Fig. 18a that the system state mean transitions into state S-1 are the most sensitive at − 21.5 when β = 0.1. Again, the negative magnitude indicates that the incremental change in the CCF level causes the mean system state transitions to decrease, which implies that the system reliability performance decreases when the level of CCF increases. Similar to the previous scenario, it is noticeable in Fig. 18b that state S-1 remains sensitive when β = 0.5 even though the diagnostic coverage is reduced.

Medium diagnostic coverage
However, the magnitude of the state transition sensitivity is reduced further to − 4.9. The elasticity of state S-1 transitions is the least compared to moving into S-2 and S-3. This observation is similar for the two CCF levels at β = 0.1 and β = 0.5. The results confirm that the system performance is most sensitive to low β-factor levels when the diagnostic coverage is medium. Figure 19 depicts the system responsiveness to CCF based on sensitivity and elasticity when the coverage is 60% for β = 0.1 and β = 0.5. The system state mean transitions into state S-1 are the most sensitive at − 1.7 when β = 0.1, as depicted in Fig. 19a. Again, the system reliability performance decreases when the level of CCF increases. Similar to the two previous scenarios, it is noticeable in Fig. 19b that state S-1 remains sensitive when β = 0.5 even though the diagnostic coverage is reduced further to 60%.

Low diagnostic coverage
However, the magnitude of the state transition sensitivity is marginally reduced to − 1.4. The elasticity of state S-1 transitions is consistently the least compared to those moving into S-2 and S-3. This observation is the same for the two CCF levels at β = 0.1 and β = 0.5. Again, the results confirm that the system performance is most sensitive to low β-factor levels.

Conclusions
The integration of the β-factor model into the Markov reliability model enhances the model flexibility in investigating various system cases, enabling the impact of CCF to be studied at different imperfect repairs levels (viz. repair efficiency and system diagnostic coverage). The Markov process provides a comprehensive method of evaluating the system performance's responsiveness and effectiveness to incremental CCF levels based on sensitivity and elasticity analysis studies. The case study results indicate that the existence of CCF significantly reduces system reliability performance. The most significant impact on system reliability is observed at low levels of CCF represented by small changes of the β-factor magnitude, whereas the highest level of impact is noticeable when the system diagnostic coverage is 99% based on ISO 13849-1. This reduces as the level of diagnostic coverage reduces. The characteristic impact of CCF is relatively similar for a given level of system diagnostic coverage and repair efficiency, as demonstrated by the case study results. Therefore, it is concluded that the severity of CCF depends more on system diagnostic coverage level than the repair efficiency as evidenced in the sensitivity and elasticity studies, even though both factors impact the system overall performance. This system response is evident from the case study results where the system sensitivity based on mean state transitions to CCF of 10% is − 281 while its elasticity is − 2.74, assuming 99% system diagnostic coverage. Similar behaviour is observed when the diagnostic coverage is 90%, where the system sensitivity is − 45.5 at 10% CCF while its elasticity is − 1.11. Sensitivity of − 4.5 and elasticity of − 0.33 of the system are observed with 60% diagnostic coverage. Overall, the system sensitivity is decreased by 84% when the diagnostic coverage is reduced from 99 to 90%, and by 90% when the diagnostic coverage reduces from 90 to 60%. The system response is similar when the CCF is 50%, and its sensitivity decreases by 32% for the diagnostic coverage reducing from 99 to 90%, and by 63% when the diagnostic coverage reduces from 90 to 60%. The system elasticity indicates the effectiveness of managing the CCF level as presented in the results.
Hence, the impact of CCF must be considered in developing reliability models of a mission-critical system to determine system performance accurately. Future research will consider diversifying the scheme channels to minimise CCF impact on the scheme reliability and employ a multiple beta factor model to determine the impact of the individual channels. The research will also consider the use of global sensitivity analysis methods. Future research will also focus on the generalisation of the findings to a KooN system.