Human reliability analysis in maintenance team of power transmission system protection

The requirement for reliable electrical energy supply increases continuously because of its vital role in our lives. However, events due to various factors in the power grid can cause energy supply to be interrupted. One of these factors is human error and thus human reliability analysis is a serious element in the industry. The first step is to identify the roots of human error, on which there has been limited research in this area. In this paper, the potential and actual causes of human error in maintenance teams of power transmission system protection are identified and predicted within a framework of human factors analysis and classification system method. Then, human error factors are ranked to help improve human reliability. The proposed method is implemented in the Fars Electricity Maintenance Company.


Introduction
Increase in electrical energy consumption requires more stable and reliable power systems, and any interruption or disturbance in the supply of modern sensitive loads may lead to high cost.
According to the annual reports from NERC [1] and In the Iran Grid Management Company [2], about 70% of electrical outages are due to equipment failures or problems in power grids, and about 9 to 17% of the outages are rooted in human error. Various studies have been carried out to identify the cause of these interruptions. Most of these analyses attempt to find the technical roots of equipment failures and solutions. However, less attention has been paid to the investigation of human error in the power transmission industry [3,4]. Surveys show that human error can affect the safety of personnel and equipment, as well as reduce the reliability of the network. It can also affect the income of electricity companies through loss of energy transmission and electricity market penalties. The impact of human error on the safety of personnel in terms of the health, psychological, and social integrity aspects are much more important than the technical aspects of failures and errors. Human reliability analysis (HRA) to reduce the causes of human error is needed [5].
Research on HRA began in the 1950s. The probability of activities carried out correctly by a person over a given period under certain working conditions is called human reliability [6]. The first step in HRA is to identify the roots of human error, and many relevant studies have been carried out in different industries, especially those in nuclear, structure, aviation, and petroleum [7,8]. However, in the power transmission industry, despite its wide range, human factor studies and root identification have not been carried out comprehensively.
The relevant studies on power systems are mainly limited to analyzing and monitoring human error and its effect on the failure of power transmission systems. However, as far as we know, there is no comprehensive report studying the root causes of human error. In this paper, a method for identifying the potential and actual root causes of human error in power transmission system maintenance is proposed.
To control the human factors, it is necessary to properly recognize the potential errors. Various models (such as Technique for Human Error Assessment (THEA), Predictive Human Error Analysis (PHEA), etc.) have been used in recent years to identify and analyze human error [9,10]. Among these methods, the HFACS method conducts a systematic procedure to find the possible causes of human error, e.g., decision error, and is compatible for analyzing human error in a power system blackout. The main advantage of this method is the division of various human error factors into a comprehensive framework of errors made by persons involved in maintenance operations (repair workers, supervisors, managers, and administartion personnel. By analyzing past events and examining the conclusions of experts, HFACS can present a comprehensive framework of human error at the four levels of unsafe acts, precondition for unsafe acts, unsafe supervision, and organizational influence.

Background and related work
One of the main causes of events in most industries, including nuclear power plants, aviation, chemical industries, etc., is human error. These are unintended errors that could lead to sudden failure [11].
Colombia's blackout in 2007 left 41 million people without power for 4.5 h. This was caused by a human error during the maintenance of a protective device in a 230 kV substation. In [12], the roots of human error in this event were identified as deficiencies related to operator training, protection settings and coordination, protection and control schemes, and instructions for scheduling and performing maintenance. Also, according to studies conducted in [13], 32% of major blackouts in some parts of the world from 2011 to 2019 were due to human error or equipment violations. Therefore, human error is an important factor affecting the reliability of power systems. Reference [14] proposes an evaluation model to assess the reliability of a power system, and includes human error and protection system failure. The evaluation results show that to have a reliable system, it is necessary to pay attention to human factors in the power grid and to manage human error. According to the above discussions on human reliability and human error in power grids, studies can mainly be described through five different topics as follows:

Identifying and investigating the causes of human error
Reference [15] examines the perspectives of five short notes reports on human factors in electric utility dispatch control centers. According to the analysis, the common denominators of all notes are the stressor and stressful conditions. Reference [16] investigates human reliability during events that occurred in the Chinese power system, and the human error factors corresponding to each event are identified using the CREAM method. The fuzzy-clonal method is then applied to classify the identified factors to determine the worst factor. However, no solution is proposed to solve the problem. Various factors such as environmental, organizational, job factors, personal characteristics, etc. that affect human reliability in maintenance are introduced in [17] and it shows the extent of human reliability being affected by changes in the factors. The research presented in [18] introduces "motivation" and "competence" as the most important human factors influencing the performance of power transmission maintenance personnel. In [19], fatigue, knowledge, experience, and time pressure are recognized as the most important human factors, while [20] shows that older operators' unwillingness to use personal safety instructions or equipment due to over-reliance on their experience increases error. In [21], human factors including the complexity of human-machine interaction, conscious and unconscious human error, are considered as one of the five risk elements in the development of a new energy power system. It also shows that the complexity of human-machine interaction is more important than the other two factors, and suggests that employees fully follow operating rules and spend more time in training. In [22], a study on job stress in human resource management is conducted and the results show that psychological factors are also very important and effective

Quantitative calculation of human reliability
In [23], a suitable method for quantitative assessment of human reliability is presented. However, in the proposed method, organizational factors and inter-dependency between operators in power system switching operations are not considered [6,24]. To improve the method in [23], references [6,24] propose methods for measuring and analyzing human reliability in a power system, methods which take into account organizational factors and the interdependency between operators during a switching operation. The results indicate that the probability of human error is much closer to the actual situation recorded in statistics. The probability of human error is estimated in [25] by combining the two methods of a Success Likelihood Index Model and a Bayesian Network.

Consequences of human error
The studies in [3,26] verify and analyze the effects of human error on power system reliability and conclude that it is essential to consider human factors when determining maintenance policy. It has shown in [14] that the two reliability indicators of LOLP and EPNS in power system increase with human error.

Evaluating personnel performance
Nowadays, electricity company operators need to manage large volumes of data because of the sensitivity of electrical energy supply, and are required to solve the problems related to unexpected events in a power grid quickly. In this way the impact of these unexpected events on consumers can be minimized. Hence, in [27], a method is designed with the help of technical and economic indicators to evaluate the performance of distribution network operators. In the case study, it is shown that the impact of human error on the consequent interruption duration for priority consumers and revenue reduction is more severe than others. The study in [28] proposes a new method for evaluating the impact of dispatchers' excessive workload on human error. This method evaluates the dispatchers' workload from the four dimensions of information comprehension, speech output, action output, and attention. Results from the study on ten dispatchers reveal the probability of human error due to inappropriate workload.

Methods of reducing human error
Reference [19] proposes an approach to identify effective solutions for reducing human error in maintenance activities based on cost-benefit analysis at the Kenya power plant. The paper divides the causes of human error into 11 factors and the most important of which are the use of instructions, fatigue, transfer of knowledge and experience, and time pressure one of the maintenance activities of power transmission lines is their inspection, which can cause fatigue to the human inspectors because of the long distances involved and sometimes impassable routes. Therefore, power line inspection software is presented in [29] to reduce human error. Experiments performed by the software show that inspectors' workload is significantly reduced, as is human error. One way to reduce the fatigue of maintenance personnel is to use digital substations, as remote testing eliminates the need to travel to the substation. In [30] the definition of digital substations and maintenance testing and remote testing is introduced. Reference [31] shows that the effectiveness of a human operator in automated systems depends on the characteristics of the workplace and the working environment. It proves that taking measures to improve the working environment, in addition to improving ergonomic indicators, is profitable for businesses. Awareness of the situation at the time of error is one of the important factors in developing security of the power system. Because the operator can make successful decisions in a timely manner and without errors and can prevent the cascading outages. In [32]

the increase of situational awareness by fault location through fault passage indicators (FPI) has been investigated
Although various research has been carried out on human error and its effects, many issues have not been fully addressed. The main purpose of this paper is to clarify the following issues: 1) "Human error" is the consequence of circumstantial and situational factors that affect human performance while itself is not the cause of events [4]. Therefore, it is necessary to identify these situational factors from all aspects such as organization, personnel, workplace, etc. to reduce human error. Research has so far only identified the causes of human error to a limited extent. 2) In September 2011, a maintenance error and weaknesses in operation planning caused power outage for more than six million households from Southern California and Arizona to northwestern Mexico [4]. According to the report in [12], the cause of two of the 14 major accidents in the world from 2003 to 2015 is human error of the maintenance groups. Studies of human factors in the maintenance of the aviation industry have been extensively conducted, while other industries have been slow to integrate human factors into their maintenance performance measurements [18]. Studying the factors affecting power industry maintenance groups is different from the factors affecting power grid operators or maintenance groups in other industries for the following reasons: Working in an HV electric environment. Existence of miscellaneous types of protection equipment from different manufacturers. Location of power transmission substations in different geographical and climatic zones. Working with RPMT in unusual times such as at night or on weekends. 3) To improve quality, maintenance operations are always supervised by supervisor groups. Therefore, the performance of the supervisor groups can affect that of the PMET. However, previous studies have not examined such factors.
power system of Fars Electricity Maintenance Company (FEMC) has been selected as a case study.

Identification of high-risk teams
The activities and duties of all the maintenance workgroups that can lead to human error are recognized, and Fig. 2 shows the personnel chart of the FEMC maintenance teams. In the FEMC power grid, which has 15,000 km transmission lines and 250 substations, more than 75 maintenance groups are daily involved with the maintenance and most of these groups have 3 personnel. Execution teams are the persons who are responsible for maintenance in transmission substations and lines over 63 kV. Headquarters teams are supportive of the executive teams from scientific, financial, and administrative aspects. In this group, two subgroups directly affect system protection, i.e., the protection relay setting team and the spare parts team. If the personnel in these two subgroups do not perform as required, unplanned interruptions of the power grid may be caused because of incorrect setting of the protection relays or the purchase of poor-quality equipment. The other subgroups in the headquarters group can also indirectly affect the performance of the executive groups. For example, should the financial subgroup not provide financial resources well, it could cause personnel dissatisfaction and misconduct.
As shown in Fig. 3, human error is primarily the cause of power grid outages of the Fars Regional Electricity Company due to maloperations in the protection sector for a period of 5 years from 2012 to 2017. As is seen from Fig. 3, human error of the maintenance teams is divided into three categories while the error of the RPMT is the largest cause of power outages. Therefore, the RPMT is detected as the high-risk work team.

HRA method selection
There is no large labeled dataset available regarding the roots of human error [33]. Thus, it is crucial to develop a systematic method that can detect the main causes of human error and classify them by using small labeled datasets. Reference [4] studies the HRA methods and shows that the THERP and CREAM methods are the most common ones in power system application. However, these methods belong to the older generations of HRA [14], and are time consuming, and do not have a clear procedure for error detection [4].
In 1990, Reason presented a model for identifying human error in air accidents, but no corrective solutions were proposed [33]. Shappell and Wiegmann introduced a model called HFACS that was developed based on the Reason model to identify human error [34]. HFACS was argued by Dekker in 2002 to be one of the most powerful tools for examining different types of incidents [35]. This method is divided into four categories based on the structure: organizational influence, unsafe supervision, precondition for unsafe acts, and unsafe acts of the operator. Then, the HFACS model is used to summarize and categorize these roots for the following reasons: 1) Extensive analysis of human error considering the multiple causes of human failure [36]. 2) Considering a framework for identifying causes of supervision. 3) General terms and descriptors allowing the HFACS method to be used for a wide range of industries and activities. 4) Among the latest generation of HRA.

Collection and classification data
The steps of identifying, collecting, and classifying the causes of human error in this paper have been done according to the framework of Fig. 4, for which more details will be given in Section 4.

Identify a technique for assessing
Since controlling and reducing the 60 causes of human error identified in Section 4 are difficult and timeconsuming, these causes are ranked to prioritize the important and key problematic factors to avoid wasting time and money.
The calculation of the probability of the occurrence of basic causes (roots) from the perspective of maintenance personnel will be described in Section 5.

Ranking and solution
Ranking, analysis, and proposing solutions for highpriority error roots are expressed in Section 6.

Roots of human error of maintenance teams
The following procedure is considered to identify and classify the causes of error based on the proposed framework in Fig. 4: ✓ Studying papers and research on human error, especially in the field of maintenance. ✓ Reviewing the steps of maintenance implementation. ✓ Investigating the history of electric power transmission industry events due to human error. ✓ Interviewing maintenance experts and technicians who have made mistakes. ✓ Interviewing skilled staff in the maintenance department. ✓ Interviewing expert supervisors on maintenance personnel. ✓ Interviewing safety experts. ✓ Obtaining the findings and comments and questionnaire design for human error description in the power system through the HFACS framework. ✓ Selecting a population based on the Cochran relationship with a 5% error (approximately 132 people) and reviewing the questionnaire comments. ✓ Finalizing the basic causes of human error at the four levels of error in the HFACS method.

Unsafe acts level
A maintenance executive team is made up of two or three people one of whom is the team leader with more experience, expertise, and skills. However, in the analysis of the events of FEMC in 2016 and 2017, it is clear that most of the human error events were made by the more experienced persons. The investigation shows that the team leaders, because of the repetitive tasks over many years, refrain from performing the tasks properly, including preliminary study of the protection plan, step-bystep follow-up of the instructions and checklists, etc. In addition, misunderstanding or devaluing the instructions and the test sheets can also lead to human error. Twelve basic causes are identified at the intermediate error level (errors and violations) as follows:

Precondition for unsafe acts
This level of error includes environmental factors, operator conditions, and individual factors. Since transmission substations and lines are usually built in suburbs, to cover the proper maintenance of the network, the executive groups are centralized in cities nearby the suburbs. On average, each group covers 6000 km 2 and in the event of something happening to the network, the teams can check and correct the network in the shortest possible time. However, this causes the maintenance groups to be on missions outside their workplace continuously. On the other hand, the large volume of work has caused insufficient staff relaxation and insufficient time with the family, resulting in physical and mental tiredness of the staff.
The shortage of backup technicians is one of the base causes of human error due to the intensification of the executive group activities. This is the most significant cause of human error in the human resource management sublevel from individual factors. Ten roots of the human error have been detected at the precondition for unsafe acts level as follows:

Unsafe supervision error level
The maintenance of power transmission networks in Iran is carried out by the non-governmental contractors that are supervised by the regional electric companies. Therefore, the supervisory groups of the employer and contractors' headquarters indirectly affect the performance of the executive groups. The roots of the errors of the employer's supervisory groups are defined at the three levels of inadequate supervision, supervisory violation, and failure to correct a known problem, while eight reasons for these causes of the intermediate error are identified as follows: ✓ Inadequate supervision US1: Supervisor not following up with the completion of the drawing defects. Limited knowledge and experience of the supervisors can result in irrational and non-normative comments that can disrupt the operation of the technical contactors. For example, it has been planned to replace old and inefficient protective relays at the time of maintenance to prevent re-outage of equipment (lines, transformers, etc.), but such changes sometimes take too long to complete due to the workload, incorrect prediction of the relay replacement time at the same time as maintenance, insufficient knowledge of the executive team, and so on. All of these are indirect factors affect the performance of the executive personnel.
Since the use of protective drawings or relay catalogs is necessary for accurate and speedy maintenance, these drawings must be modified after any change in the protection circuits or equipment. Supervisors are responsible for the update, but sometimes due to lack of effective follow-up, executive groups experience shortcomings or contradictions.
The subgroup on scheduling maintenance operation and the program coordinator at the headquarters of the contractor company cause human error at planned operations with nine identified causes. Studies show that the roots of inappropriate scheduling can be expressed by the following factors: US9: Maintenance personnel not having the proper time to rest and upgrade their knowledge. US10: Maintenance operation being carried out at an inappropriate time (such as: from 0 am to 6 am or during holidays). US11: Setting the maintenance schedule regardless of the environmental and power network conditions. US12: Highly demanding maintenance operations and accompanied by repetitive actions.
US13: Trying to fix mistakes related to the existing data, settings, spare parts, etc. by executive teams at runtime. US14: Maintenance operation lasting more than the working hours. US15: Employer's request for maintenance operation outside the rules or guidelines. US16: Synchronization of corrective or defective projects with maintenance operation. US17: Unsuitable appointment of personnel for sensitive tasks.

Organizational influence error level
The behaviors and decisions of the managerial level directly affect the mental conditions and activities of the operating groups such that even the smallest incorrect decision can cause disturbance and distrust in the whole organization. This level of error with 21 identified causes is the main reason for human error in terms of staff surveys and event roots. This has been reported as follows: Human resources are the most important assets of any maintenance organization. Therefore, strong motivational strategies can retain specialized and experienced personnel in the organization and attract new expert staff.
The Three resources, namely financial resources, testing equipment, and human resources, can help improve activities. When financial resources are sufficient, personnel may receive appropriate salaries and the testing equipment can also be updated along with the development of electrical equipment.
Clarity in job descriptions and instructions can prevent staff confusion. For example, should there be no step-by-step instruction in the differential transformer relay test, each executive group would have to perform the relay setting based on the related experience and knowledge, which could cause unstable operation of the relay and inappropriate performance during the operation of the transformer. Such human error does not result from the mistakes of the executive groups, but is rather due to weakness in the organization.

Determine the probability of occurrence of the base causes of errors
Since there is no documented information from past events, especially on the basic causes of errors, calculating the probability of the occurrence of the basic causes is not possible. Hence, the probability of the occurrence of the events is calculated using questionnaire surveys among the experts. The probability of the occurrence of the error is classified into five categories: frequent, probable, occasional, very low, and unlikely. The results of the survey are obtained from experts in both qualitative and linguistic forms, which are converted into numerical scores of 5 to 1 for subsequent calculations.
Thirty maintenance experts of power system protection, who act as the head group, supervisor, technician, or worker are selected according to Table 1. Since the experts are at different levels on things such as education, work experience, age, and organizational level, relative weighting factors as shown in Table 1 are applied to the expert point of view. The relative weighting factor W k is obtained as: where W k is the relative weight of the expert k, S ki is the score of the expert k on the four criteria, S ji is the score of the expert j on the four criteria, and ne is the number of experts.
The experts' perspectives about the probability of the occurrence of the causes of the errors are calculated and presented in the form of consensus for each cause of the error as: where M n is the consensus expert opinion on the probability of error n, A nk is the opinion of expert k on the probability of error n, W m is the average relative weight of experts, and P is the number of people who have commented on the probability of error n.

Results and discussion
There are approximately 7600 relays installed in the Fars electricity network to protect the power grid in the case of fault events and to prevent catastrophic outages. The RPMT of FEMC must inspect and repair the protection circuits, obtain their health status, and calibrate settings of these relays throughout the year. Each of these maintenance operations may be carried out with error or low accuracy or negligence, under the influence of various root causes by any member of the team. This negligence may result in the failure of the same or other equipment during or after a maintenance operation. Our survey shows that 70% equipment failures caused by human error occur after the completion of maintenance operations because maintenance operators were not working precisely. For instance, if the operator in the protection subgroup applies a wrong setting or configuration to the relay, it may lead to the relay malfunctioning during its operation. Another example is the unplanned outage of transformers. If the maintenance group does not seal the mechanical relays (Buchholz relay or thermometer) precisely during transformer maintenance, they will misoperate because of water penetration.
Studies were conducted on FEMC protection teams in 2017, and were performed into approximately 20 3-h sessions. The results of the study and the survey on the opinion of experts for predicting the roots of RPMT errors are analyzed in terms of their occurrence probability and are as follows: 1) Sixty underlying causes of errors have been detected according to the protection experts' opinion, and these affect the performance of personnel during maintenance. Organizational factors and unsafe supervision factors as external stimuli, and unsafe act factors as the internal stimuli, both impact the operation of the maintenance personnel. These causes are classified into 20 proposed subcategories of the HFACS method. The relationship between the introduced error levels and the consequence is shown in the conceptual model of Fig. 5. As shown, the level of organizational factors impacts other levels, and the unsafe act factors level is affected by other error levels. The results of this study show that the cause of the intermediate error of organizational influences, especially financial resource management has the greatest impact on human error from the viewpoint of RPMT according to Table 2, while the causes of the intermediate error of supervisory and exceptional violations have the lowest effect. Table 3 shows the results of the specialized surveys on the probability of causes of human error for the maintenance teams on power system protection. According to the collected data, there is no probability of error in the "frequent" and "unlikely" categories.

2)
Seven percent of the basic causes of the identified error affect the performance of RPMT with 'probable' probability. Figure 6 shows the ranking of the causes, where Red indicates basic causes with 'probable' probability, blue with 'occasional' probability, and green with 'very low' probability. The two basic causes of error with 'probable' probability affect the performance of all teams, as follows: ➢ Paying attention to personnel finance issues such as salary reform based on the rank and job position can boost teams' morale. ➢ Not employing sufficient number of personnel puts a heavy burden on the working groups with little time for rest and refreshing. They also feel a great deal of physical fatigue because of the exposure to environmental and atmospheric factors. This could lead to errors during maintenance work.  To eliminate or reduce the identified errors, accurate planning and management are required, e.g.: Personnel assessment and ranking. Paying attention to the knowledge of experienced maintenance personnel. Accelerating staff recruitment. Prioritizing maintenance checklist items. No mobile 'phone use while working. Paying a reasonable salary.
The results from this study can provide a good reference for the following future studies: 1. Use of virtual environments to identify and analyze human factors instead of surveys. 2. Use of virtual environments to calculate the probability of underlying causes of human error. 3. Studying the logical relationship between the causes and consequences of human error. 4. Ways to calculate the quantifying human reliability of relay protection maintenance personnel and its improvement. 5. Relationship between human error and profitability of maintenance contractors and analyzing the costbenefit of reducing human error.

Conclusion
The analysis of recent events in the Fars Electricity Maintenance Company caused by human error shows that the protection subgroup accounting for 62% of the errors caused most of the human error-induced events among the executive maintenance teams. In this study, 60 basic causes of human error are identified and predicted using 4 levels of error in the HFACS method, which can be used to control and increase human reliability of maintenance teams and in electrical industry research.
Analyses and surveys from the RPMT in the FEMC show that 7% of the 60 basic causes have a high probability (i.e., are probable) of resulting in events. According to the results, the salary system, the inadequacy of test equipment, the shortage of personnel and their tiredness due to high workload are the main causes of effective factors in the behavior and performance of the executive personnel. Also, decisions at the top of the organization directly affect the performance at the lower levels.