August 17, 2018
Electrical distribution systems in the U.S. served approximately 152 million customers in 2017. A large portion of the electricity delivered by the utility companies is generated in power plants, brought to main distribution hubs, and then distributed to the utility customers from central hubs known as distribution substations. Therefore, if a distribution substation is lost due to a major failure, service to a large number of customers can be disrupted.
Major failures such as these are rare, but when they occur, post-failure investigations of the events are of paramount importance to reduce the likelihood of such incidents from happening again. The investigations should include not only the technical aspects of the failures but also the organizational factors contributing to the failures. Suggested corrective actions should also be outlined and discussed.
This article is an overview of components of such types of investigations with reference to a substation outage that occurred in a large U.S. metropolitan area. The outage resulted in the loss of service for tens of thousands of customers for multiple hours.
The Exponent investigative team was dispatched to the incident scene the same day and started the investigation immediately. Exponent performed a thorough investigation related to the root cause and other contributing factors including the emergency response and customer restoration time after the incident. The Exponent report in this case outlined the findings, discussed contributing factors, and recommended corrective actions, all of which were implemented.
Electrical substations incorporate redundancies and have automated protection systems to increase safety and minimize customer outage in case of a failure. Several protection layers, known as protection zones, are monitored by automatic protective devices that can isolate sections of a substation and its feeders within a fraction of a second if a failure occurs.
These systems help minimize the number of the customers lost in case of the failure of one section. Some substation designs include automatic load transfer to the operating sections of a substation in case of the failure of other sections. Sophisticated protection and load transfer schemes are also programmed into the protection system to ensure reliable transfer of loads and to minimize the total duration of outages.
For the reasons mentioned above, electrical substations are highly reliable, and outage events are relatively rare. However, equipment failures do occur within substations that result in the loss of power to the substation. These events usually include multiple failures leading to the substation outage. More often than not, though, there are opportunities to detect the initial failures and prevent them from cascading to a major event leading to the loss of the substation.
Study of past failures provides valuable lessons to prevent future similar events. However, two major substation failures are rarely identical. The failures are also usually unique to the substations and their components. This makes the study of the organizational and operational barriers against total substation loss important. Therefore, after analyzing substation failure incidents, it is important to identify gaps in the organizational and operational procedures that could lead to such failures.
The recommended causal assessment methodology is a structured approach for causal analysis and consists of the following five steps:
- Data Collection
- Reconstruction of Problem Scenarios
- Performance of Causal Analysis
- Direct cause
- Root cause
- Review of Restorations and Extent of Condition
- Development of Recommended Actions to Prevent Recurrence
These steps are described below.
Data collection is performed through site inspections, review of event-related documents, digitally recorded data, recorded voice conversations, public records, and interviews. This includes data recorded at the time of the incident by the digital protection and the Supervisory Control and Data Acquisition (SCADA) system or Distributed Control System (DCS).
Important information is typically available at early stages following an incident; therefore, it is important that the investigative team be at the site of the incident as soon as possible. The investigative team at the site typically interviews available personnel, collects data, and photographically documents the substation condition. Following the initial site visit, the investigative team needs continued access to the site for follow up data collection and personnel interviews.
Exponent performed numerous site inspections in the case of the above-mentioned substation investigation during the months following the incident. During those site visits important data were collected regarding the configuration of different systems and during the removal and replacement of the failed components. All removed components were tagged as evidence and preserved. Later inspection and testing of those components provided valuable clues about the cause of the incident.
Of particular importance during the substation failure investigation was the inspection of the hardware of the protection and SCADA systems. This was to ensure that they operated as intended. Subsequent interviews of the personnel who either witnessed the failure or were knowledgeable about the substation operational and historical problems helped in determining the direct cause of the failure.
Collected information was not limited to the events leading to the incident; it also included the reactions after the incident and the restoration process to evaluate the effectiveness of recovery. Publicly available information about the incident was also included in the data collection process as part of the investigation. This included media reports, police and fire department reports, personal cell phone photos, and information posted on social media.
Reconstruction of problem scenarios
As a result of the data collection activities, Exponent developed a timeline during the substation investigation to identify the relative timing of events for evaluation. The detailed timeline included events from the emergency actions after the incident to when the restoration was complete. The timeline was constructed using all available data, including SCADA logs, audio recordings, phone records, cell phone photographs, security camera footage, badge reader logs, the station logbook, fire alarm panel logs, the fire department's computer-aided dispatch, and the fire department's radio recordings.
Performance of causal analysis
There are different known methods of causal analysis for determining root causes. Causal analysis is not limited to the technical aspects of the failure, herein referred to as the direct cause. It also includes other factors such as processes, policies, human factors, and organizational factors.
A commonly used method is the Events and Causal Factors Analysis (ECFA) approach. Exponent used this method to identify potential systemic incident causes (e.g., management policies and organization) for each initiating event during the substation failure investigations. It involved repeatedly asking why the event or precondition existed and providing evidence to support the why in order to identify the underlying causes.
The outcome of the above causal analyses was the identification of the causes including the technical cause. This information formed the basis for developing recommended corrective actions.
Causal analysis — direct cause
The direct cause of the failure is the technical reason for the failure, such as failing equipment or a malfunctioning protection device. The direct cause does not include other factors such as processes, policies, human factors, and organizational factors.
An engineering review of the collected data after the incident is important to determine the direct cause of the incident. This usually involves reviewing the digitally collected data as well as the system protection targets. The direct cause analysis also requires understanding the system design. Often the investigative team has to review technical documents such as the station one-line and direct current (DC) diagrams to fully understand the extent of the conditions leading to the incident and to interpret the collected data.
The direct cause should include not only the review of relevant electrical and protection designs but also the actual installations and their consistency with the designs and their intended function. This may include inspecting relevant control wires in the substation, since in some of the older installations, wiring and equipment age and wear could be a contributing factor. The investigative team can potentially perform tests on the DC control system to ensure correct functionality of the system, though such tests after the incident in an active substation may be difficult and may require coordination with the distribution control center.
The direct cause of the incident is identified when the underlying technical cause of the incident is determined. The underlying technical cause must explain all the available evidence including the digitally collected data.
Causal analysis — root cause
The causal analysis was performed using a methodology similar to the direct cause analysis; however, the process was not limited to the technical aspects leading to the incident. The potential systemic causes and contributing factors included processes, policies, human factors, and organizational factors for each initiating event. Causal analysis charts are important tools that were used to identify the underlying causes of the incident in a structured way.
Restorations & extent of condition
The investigative team also performed analysis of the response after the incident including the emergency actions and restorations. As the substation failure involved a fire, important lessons were learned to improve effectiveness of the substation crew response and coordination with the fire department. This analysis revealed gaps in the utility's fire response and personnel training regarding following procedures in case of a substations fire. It also revealed gaps in the coordination and training with the first responders. Suggested corrective actions were applicable to the other substations as well.
Review for extent of condition & extent of cause
An outcome of the causal analysis was to identify the potential for a condition or cause to exist elsewhere, which could lead to a similar failure. During this review, it was important to consider what other events would occur as a result of similar organizational or management factors.
Development of recommended actions to prevent recurrence
The desired outcome of causal analysis is to identify recommended corrective actions to prevent recurrence of the problem and to identify lessons learned. Effective corrective actions are those that address the causes, are implementable by the organization, and are consistent with company business goals and strategies. All of the Exponent recommendations in the above-mentioned substation failure case were implemented.
Electrical substation failures are rare events, but when they occur, they need to be thoroughly investigated. Important information can be lost if the causal investigation is limited to the technical aspects of the failure or the "direct cause." Other contributing factors such as human factors, policies, and organizational factors should also be included in the causal analysis. Investigative teams should have access to the site to conduct the investigations as soon as possible after the incident. Investigative teams should also have unrestricted access to relevant design documentation and site access during the replacement of the failed units for site inspections and personnel interviews. A structured approach is suggested in this article consisting of several steps. The process involves repeatedly asking why the initiating event existed and providing evidence for each "why" in order to identify the underlying causes.
How Exponent Can Help
Exponent consultants have comprehensive expertise in the planning, construction, operation, and maintenance of traditional utility systems and their components and newly emergent technologies reshaping the industry. We apply this knowledge to solve complex problems such as substation failures. Exponent has performed several substation failure investigations. Our team responded to the substation outage discussed here on the same day. Investigations continued for several months, and the final report included recommended corrective actions after identifying the underlying causes.