|(published in: Kleinman, Cloud-Hansen, Matta, and Handelsman (editors), Controveries in Science and Technology Volume 2, Mary Ann Liebert Press, 2008.)
Technical and Managerial Factors in the NASA Challenger and
Columbia Losses: Looking Forward to the Future
Nancy G. Leveson, MIT
The well-known George Santayana quote, “Those who cannot learn from history are doomed to repeat it”1 seems particularly apropos when considering NASA and the manned space program. The Rogers Commission study of the Space Shuttle Challenger accident concluded that the root cause of the accident was an accumulation of organizational problems.2 The commission was critical of management complacency, bureaucratic interactions, disregard for safety, and flaws in the decision-making process. It cited various communication and management errors that affected the critical launch decision on January 28, 1986, including a lack of problem-reporting requirements; inadequate trend analysis; misrepresentation of criticality; lack of adequate resources devoted to safety; lack of safety personnel involvement in important discussions and decisions; and inadequate authority, responsibility, and independence of the safety organization.
Despite a sincere effort to fix these problems after the Challenger loss, seventeen years later almost identical management and organizational factors were cited in the Columbia Accident Investigation Board (CAIB) report. These are not two isolated cases. In most of the major accidents in the past 25 years (in all industries, not just aerospace), technical information on how to prevent the accident was known and often even implemented. But in each case, the potential engineering and technical solutions were negated by organizational or managerial flaws.
Large-scale engineered systems are more than just a collection of technological artifacts.3 They are a reflection of the structure, management, procedures, and culture of the engineering organization that created them. They are also, usually, a reflection of the society in which they were created. The causes of accidents are frequently, if not always, rooted in the organization—its culture, management, and structure. Blame for accidents is often placed on equipment failure or operator error without recognizing the social, organizational, and managerial factors that made such errors and defects inevitable. To truly understand why an accident occurred, it is necessary to examine these factors. In doing so, common causal factors may be seen that were not visible by looking only at the direct, proximal causes. In the case of the Challenger loss, the proximal cause4 was the failure of an O-ring to control the release of propellant gas (the O-ring was designed to seal a tiny gap in the field joints of the solid rocket motor that is created by pressure at ignition). In the case of Columbia, the proximal cause was very different—insulation foam coming off the external fuel tank and hitting and damaging the heat-resistant surface of the orbiter. These proximal causes, however, resulted from the same engineering, organizational and cultural deficiencies, and they will need to be fixed before the potential for future accidents can be reduced.
This essay examines the technical and organizational factors leading to the Challenger and Columbia accidents and what we can learn from them. While accidents are often described in terms of a chain of directly related events leading to the loss, examining this chain does not explain why the events themselves occurred. In fact, accidents are better conceived as complex processes involving indirect and non-linear interactions among people, societal and organizational structures, engineering activities, and physical system components.5 They are rarely the result of a chance occurrence of random events, but usually result from the migration of a system (organization) toward a state of high risk where almost any deviation will result in a loss. Understanding enough about the Challenger and Columbia accidents to prevent future ones, therefore, requires not only determining what was wrong at the time of the losses, but also why the high standards of the Apollo program deteriorated over time and allowed the conditions cited by the Rogers Commission as the root causes of the Challenger loss and why the fixes instituted after Challenger became ineffective over time, i.e., why the manned space program has a tendency to migrate to states of such high-risk and poor decision-making processes that an accident becomes almost inevitable.
One way of describing and analyzing these dynamics is to use a modeling technique, developed by Jay Forrester in the 1950s, called System Dynamics. System dynamics is designed to help decision makers learn about the structure and dynamics of complex systems, to design high leverage policies for sustained improvement, and to catalyze successful implementation and change. Drawing on engineering control theory, system dynamics involves the development of formal models and simulators to capture complex dynamics and to create an environment for organizational learning and policy design.6
Figure 1. A Simplified Systems Dynamics Model of the NASA Manned Space Program
Figure 1 shows a simplified system dynamics model of the NASA manned space program. Although a simplified model is used for illustration in this paper, we have a much more complex model with several hundred variables that we are using to analyze the dynamics of the NASA manned space program.7 The loops in Figure 1 represent feedback control loops where the “+” and “–” on the loops represent the relationship (positive or negative) between state variables: a “+” means the variables change in the same direction while a “–” means they move in opposite directions. There are three main variables in the model: safety, complacency, and success in meeting launch rate expectations. The model will be explained in the rest of the paper, which examines four general factors that played an important role in the accidents: the political and social environment in which decisions were made, the NASA safety culture, the NASA organizational structure, and the safety engineering practices in the manned space program.