March, 2001Managing Complex Technical Systems Working on a Bridge of Uncertainty by Eli Berniker Ph.D Pacific Lutheran University Tacoma, WA, USA And Frederick Wolf Ph.D University of Pu
Trang 1March, 2001
Managing Complex Technical Systems
Working on a Bridge of Uncertainty
by
Eli Berniker Ph.D Pacific Lutheran University Tacoma, WA, USA
And
Frederick Wolf Ph.D University of Puget Sound Tacoma, WA, USA
Keywords: normal accidents, complexity, uncertainty, hazardous chemical
releases, petroleum refineries, safety
Trang 2Managing Complex Technical SystemsWorking on a Bridge of Uncertainty
“Anyone can make something work It takes an engineer to make it barely
work.” Drew Worth, 1988 (personal communication)
Similar projections might be ventured in a wide variety of industries
operating complex, tightly coupled high consequence technical systems Despite the efforts of industry and government to improve safety, major industrial
accidents are as likely today as they were 10 years ago Since 1982, there has been no reduction in the fatality rate or major accident rate in Us or European industry (Sellers,1993) (National Safety Council,1998)
All such systems are socio-technical systems, i.e technical systems operated
by organized groups of people Therefore, any general theory of complexsystems risk must be anchored in both engineering, and consequently the naturalsciences, and organizational science At present, the engineering, managerial,and organizational approaches to systems safety derive from incommensurable
Trang 3theoretical foundations producing partial analyses They cannot be readilyintegrated into a coherent theory of systems failure and, therefore, a usefullycomplete model for systems safety We will propose a general theory of complexsystems failure, anchored in the laws of physics which define the fundamentalchallenges of systems safety and suggest links to the engineering andorganizational disciplines relevant to these challenges.
We start with Normal Accident Theory (NAT) (Perrow, 1984, 1999) as avalidated model of the relationship between technical system characteristics andtheir reliability and safety Normal Accident Theory will be restated in terms of theSecond Law of Thermodynamics as the basis for a general theory of systemsfailure relating systems complexity, and coupling and the necessary uncertaintyassociated with the stochastic nature of failure paths The fundamental challenge
in managing complex high-consequence systems is the need to operate safelyunder conditions of uncertainty and continuing failures
We will review current theories of systems safety with particular emphasis onhow they relate to uncertainty, failure, systems complexity and coupling.Revisiting NAT, we will discuss the research evidence for its validity Finally, wewill propose a general model for interdisciplinary collaboration to improve themanagement of complex high consequence systems
Normal Accident Theory
Perrow defines normal accidents as the outcome of multiple interacting
events that are incomprehensible to operators when they occur and result in catastrophic disruption of systems These are systems failures rather than
Trang 4component failures However, component failures can be catalysts of systems
accidents Normal accidents are rare but, according to Perrow, an inevitable
result of systems designs
Perrow defined interactive complexity and tight coupling as characteristics of engineered systems that lead to normal accidents, complexity and coupling are also characteristics of organizations and, in the form of limited resource
availability, act to tighten the effective coupling of engineered systems As a social scientist, Perrow utilized a two by two matrix to classify industries
according to these parameters Interactive complexity can be understood as an indication of the number of possible states available to the system which Perrow described in terms of close proximity, interconnected subsystems, common modeconnections, multiple interacting controls, indirect information and limited
understanding It is important, for theoretical reasons, to differentiate the
engineered characteristics of systems from those related to their human
operators Both indirect information and limited understanding can relate to
engineering design and also to operating organizations Intuitively, although not rigorously, interactive complexity suggests the number of ways a system might fail
Coupling is descriptively defined as the availability of buffers, resources, time,and information that can enable recovery from failures Intuitively, coupling suggests the availability of possible recovery paths Tight coupling implies limitedrecovery potential while loose coupling implies many more recovery paths
Trang 5complex sequence of events at Three Mile Island that all occurred within
eighteen seconds Note that by far the vast majority of failures are minor
incidents for which recovery is possible and adverse consequences rare
However rarely, normal accidents do occur often with catastrophic
consequences
The research challenge was to convert this classification scheme into a usefulmetric that can be applied to engineered technical systems NAT is, first and foremost, a theory about engineered systems focusing “on the properties of systems themselves rather than on the errors that owners, designers, and
operators make in operating them “ (Perrow, 1999) Wolf, in his research on petroleum refineries, ( 2001) operationalized complexity in terms of process parameters and their states Coupling was operationalized in terms of resource availability, a constraining factor in all complex systems Both allow NAT to be stated in thermodynamic terms
The Second Law of Thermodynamics
The Second Law of Thermodynamics, a fundamental law of physics, states that all systems must move towards increasing entropy or disorder Shrodinger (1967) explained the apparent capacity of open systems, including living
systems, to grow and maintain themselves by their ability to import “negative entropy.” Such a process must convert larger amounts of energy into entropy than it needs in order to retain ordered energy and grow If the living, or open system, and its environment, are taken as the system, the Second Law is not violated
Trang 6A petroleum refinery is a very high energy complex technical system that must perpetually seek entropic equilibrium The ultimate equilibrium entropic state of such a plant would be a smoking black hole The resources invested in ongoing maintenance of the refinery are necessary to assure safe operation and
to reverse the entropic effects of wear on the system
Boltzman formulated the equation for the entropy of a system (Schrodinger, 1967).:
S = k ln D
Where:
S is a measure of the entropy (disorder) present in a system The
equilibrium state represents the end point of the system
k is the Boltzman constant, which for high-risk technologies would be
expressed as the energy density of the system
D is a “measure of the atomistic disorder” (Shrodinger, 1967)
For engineered systems, D is the total number of possible states accessible tothe system
Wolf (2001) operationalized system interactive complexity as Ci, an index of possible states of the system that is a subset of D Coupling was operationalized
in terms of slack resources available to the system Greater resources loosens coupling and enables both prevention and recovery Scarcity of resources limits recovery possibilities In thermodynamic terms, resources represent the sources
of negative entropy required to maintain the system and to recover from wear,
Trang 7It should be noted that this use of Boltzman is not the first time that his equation has been applied to organizational science Shroedinger (1967) in
What is Life inverted Boltzmann’s equation coining the result “negative entropy”
for order which, in turn, Bertalanffy (1968) translated into information and
organization as a foundation for general systems theory The systems model so widely used in the organizational sciences owes much to the Second Law
equation
A General Theory of Systems Failure
The Second Law of Thermodynamics systems provides a firm basis for integrating Normal Accident Theory into a general theory of failure The Second Law forces us to accept the inevitable risk of failure of technical systems The Boltzman equation makes clear that catastrophic failure is a consequence of entropy or disorder associated with the equilibrium end state of systems Thus, failure must be associated with uncertainty and the impossibility of designing, building, or operating perfect technical systems Leveson (1995) has argued about similar limitations on software and Weinberg (1975) has demonstrated that such limitations are inherent in our scientific models Weick ((1990) recognized this character of technology:
The very complexity and incomprehensibility of new technologies may warrant a reexamination of our knowledge of the cause and effect
relations in human actions (p150) [and] the unique twist of the new
technologies is that the uncertainties are permanent rather than transient (p152)
Trang 8Based upon the Second Law, we may conclude:
Any system of sufficient complexity and tightness of coupling must, over time, exhibit entropic behavior and uncertainty resulting in its failure.
Failure is a certainty and, therefore, recovery and repair must be possible for the system to avoid end state equilibrium In the case of complex, tightly coupled systems, we simply cannot know enough about them to anticipate all modes of failure and prevent them within the time and resources available
The challenge is:
How can we design, build, and operate complex, high consequence, technical systems under conditions of necessary uncertainties and many possible paths to failure?
Theories of Failure
Two major disciplines: organizational science and engineering, provide alternative models of failure We will test their underlying paradigms with respect
to their implicit sources of failure, and, therefore, conditions for systems safety
We note that each of these disciplines makes valuable contributions to systems safety Yet, given necessary uncertainties, all must be deficient
The Assumption of Human Error
Shared across engineering and organizational science is a focus on
human error as the source of failure and a strong tendency to confound failure with error A pervasive theme in the psychological and organizational literature
is human error The assumption is that if human error can be controlled and prevented, system safety will be assured Implicit in this view is the denial of systems uncertainty
Trang 9Turner and Pidgeon’s (1997) research, on 13 large industrial accidents, demonstrates the focus on human error They note during the time immediately preceding an accident: 1) events were unnoticed or misunderstood because erroneous assumptions were made; 2) discrepant events were unnoticed or misunderstood as a result of problems in information handling in complex
situations; 3) events that warned of danger passed unnoticed as a result of
human reluctance to fear the worst; and/or 4) formal precautions were not up to date and violation of rules and procedures were accepted as the norm Note thateach of these failure categories relates to human error or cognitive processes
According to Weick and Roberts (1993), “We suspect that normal
accidents represent a breakdown of social process and comprehension rather than a failure of technology Inadequate comprehension can be traced to flawed mind rather than flawed equipment” (p.378) Roberts (1989) has written, “It is notreally clear that all high risk technologies will fail” (p 287) By contrast, Leveson (1995) notes that systems can fail when they operate exactly as planned
Weick, Sutcliffe and Obstfeld (1999) argue that:
Theoretically, a system with a well-developed capability for improvisation should be able to see threatening details in even the most complex environment, because, whatever they discover, will
be something they can do something about
The evidence presented in validating NAT will show that with respect to complex systems, this is a cognitively impossible task
It is clear that we make errors and that many failures may be caused by our errors Leveson (1995) has critiqued engineering attributions of accidents to
Trang 10human errors arguing that “the data may be biased and incomplete” and that
“positive actions are usually not recorded.” Attributions are often based on the premise that operators can overcome every emergency… It appears…that the operator who does not prevent accidents caused by design deficiencies or lack ofproper design controls is more likely to be blamed than the designer.” Moreover,
“separating operator error from design error is difficult and perhaps impossible.”
She cites Rasmussen as arguing that “human error is not a useful term”
and expresses particularly well our core disagreement with this school of thought:
The term human error implies that something can be done to humans to improve the state of affairs; however, the “erroneous” behavior is inextricably connected to the same behavior required for successful completion of the task (Leveson,1995, p102)
Industrial Safety
The assumptions of human error are linked with a perception shared across organizational and engineering disciplines that incident rates are linked to systems level risk management
In 1959, Heinrich published his classic work, “Industrial AccidentPrevention.” He described a ‘foundation for industrial safety’ programs which hascome to be known as ‘Heinrich’s Triangle’ by safety practitioners in industry.Based on observations derived from a sample of 5000 industrial accidents,Heinrich (1959) determined “in a unit group of 330 accidents of the same kindand involving the same person, 300 result in no injury, 29 in minor injuries and 1
in a major loss” (p 26) This observation led to the realization that a
‘preventative opportunity’ exists to manage safety If the frequency of accidents
Trang 11that result in no injury can be reduced, the corresponding frequency of moreserious accidents can also be reduced
Proponents of High Reliability Organizations share this focus on trackingand corrective action of incidents and minor accidents is essential in assuringcontinued reliable operations However, as discussed by Rasmussen (1990) “ .the error is a link in the chain, in most cases not the origin of the cause of events”(p 1186)
There is no doubt that the safety records of complex systems can beimproved when we focus on incidents and minor accidents We cannot assumethat Heinrich’s triangle correlating rates of incidents and accidents is evidencethat reducing the frequency of non injurious accidents results in a correspondingreduction in systems accidents Wolf (2001) found that “Safety performance…was unrelated to the complexity of the process system” while his results stronglydemonstrate that complex refineries experience more frequent hazardous
chemical accidents.” Therefore, the linkage between efforts to improve individual
safety and the prevention of systems accidents is problematic
Organizational Risk Management and Resource Models
Systems failure incidents may be understood as signals of impending risk.Responses to these warnings may be controlled or mediated by resource
availability
Marcus and Nichols (1996)(1999) described an organizational ‘band ofsafety’ According to this model, organizations ‘drift’ within an acceptableperformance envelope Warnings of impending danger are signaled by
Trang 12increased rates of minor incidents and accidents The organization can takeaction and correct the deviation recognizing, “Correction depends on themagnitude of the signal, the sensitivity of detection and the width of the detectionrecovery zone” (Marcus and Nichols, 1996, p 3)
Since a ‘drifting band’ varies temporally, it calls for a decision model thatdeals with time variation According to March (1994), “Solutions are answers toproblems that may or may not have been recognized They can be characterized
by their arrival times and their access to choice opportunities as well as by theresources they provide to decision makers who are trying to make their choices”(p 200) As Rasmussen (1994) cautions, “During periods of economic pressurethe safety boundary is likely to be breached because the accident margin will beunknown from normal work practice and will not reveal itself locally” (p 29)
Rose (1990) identified a linkage between safety performance andresource allocation In her study of airlines “… lower profitability correlated withhigher accident and incident rates” (p 944) She attributed this to riskyorganizational behavior including the use of older equipment, deferred upgradesand equipment modernization; the hiring of less experienced, lower salariedemployees, and to the use of low-bid outside contractors for aircraft maintenance
Integrating these models suggests that the bandwidth of the safetyboundary is determined (at least to some extent) by resource availability andeconomic performance The bandwidth of the Marcus and Nichols model isestablished by the frequency of signals, including incidents and minor accidents,
Trang 13mishaps, etc., and the level of risk acceptable to the organization at a particulartime.
Shrivastava’s (1992) conclusions concerning the Bhopal accident areconsistent with this view “If a plant is strategically unimportant, it receives fewerresources and less management attention, which in turn, usually makes it lesssafe” (Shrivastava, 1992, p 43) Stein and Kanter (1993) also recognized thisphenomenon; “The culprit becomes a system under such pressure to performthat mistakes are encouraged, constructive actions undercut and informationwithheld” (p 59) Davidson (1970) observed:
The supervisor has to make a decision as to whether to halt productionwhen a question is raised on the safety of continued operation…he askshimself if it is unsafe to the point that he must shut down the operations.Nine times out of ten he is incapable of making the judgement He istrained to keep production going…Always on the back of his mind is theknowledge he will be held responsible for the loss of production that willoccur if he takes the cautious course of slowing or stopping operations.(p 108)
The key assumptions underlying the organizational and managerialapproaches to complex systems management are that we can organize forsystems safety with the right structure, culture and sufficient resources Thetechnical system is accepted as a given and it is assumed that all of it’s attendantequivocality and uncertainties can be overcome by organizational means.Uncertainty is not seen as a fundamental characteristic of complex systems but aproblem that can be overcome with good organization
Trang 14Engineering Reliability
Reliability and safety are concepts based upon opposing logic inengineering Reliability is a characteristic of items or parts that expressed by theprobability that they will perform their required functions in a specified mannerover a given time period and under specified or assumed conditions Reliabilityuses a bottoms-up approach assuming that if the elements of a system arereliable, the whole system will be safe (Leveson, 1995)
The assumption that systems reliability is a function of individualcomponent reliability leads to several design principles and a paradox (Sagan,1996) Reliability enigneers concerned primarily with failure rate reduction mayutilize parallel redundancy of parts, standby sparing of units, safety factors andmargins, reducing component stress levels, and timed replacements (Leveson,1995) “While these techniques are often effective in increasing reliability, they donot necessarily increase safety” (Leveson, 1995) Safety is a systems levelphenomena which must be evaluated in terms of combinations of eventsinvolving both incorrect and correct component behavior and the environmentalconditions under which these occur Safety is a function of systems functioning
in relation to its operating environment This view is more consistent with theSecond Law
The paradox of engineering reliability strategies is “the paradox ofredundancy” identified by Sagan (1996) As the number of components increasesand the potential for catastrophic common mode failures exists, the probability
Trang 15In addition, the engineering reliability approach does not incorporate theoperating organization and its capabilities Hirschhorn (1982) notes that:
Engineers have not learned to design a system that effectivelyintegrates worker intelligence with mechanical processes Theyseldom understand that workers even automated settings mustnevertheless make decisions; rather they tend to regard workers asextensions of machines (p107, Leveson, 1995)
As a result, engineers often increase the incomprehensibility of complexsystems They design the monitoring, instrumentation, and feedback systemsthat inform operators about system states Appropriate design of monitoringsystems should be based upon an understanding of cognitive processes andindividual and group decision making, domains that are generally excluded in anengineering education
Engineering Design
Engineering design is a formal process of incremental steps that areintegrated to yield a technical system A design project begins with a feasibilitystudy The feasibility study is a formalized screening of alternatives that couldsatisfy the need which the engineering effort is intended to address (Dorf, 1996)
The preliminary design stage follows conceptual design This design isprovides enough detail to allow a value engineering review performed to evaluatethe probable life cycle cost of the project During this stage hazard analysistechniques are used to review the design While useful, none is capable ofidentifying all potential failure modes and hazard scenarios associated with thenascent design Risk balancing is part of the design process Engineering isproblem solving, “…applying factual information to the arts of design What
Trang 16makes it intellectually exhilarating is resolving conflicts between performancerequirements and reliability on one hand and the constraints of cost and legallymandated protection of human safety and environment on the other designbecomes a tightrope exercise in tradeoffs” (Wenk,1995, p 22)
Safety factors or margins of error must be included in every design tocompensate for uncertainties which can arise from many factors When a failureoccurs, it yields information concerning the limitations of its design (Petroski,1994) There is a Faustian bargain struck in the design process of all technicalsystems; the engineer must balance the need for reliability and safety versuseconomic and physical constraints
Petroski warns (1992) not to expect an engineer to “…declare that adesign is perfect and absolutely safe, such finality is incompatible with the wholeprocess, practice and achievement of engineering” (p 145)
Petroski (1992) describes the nature of engineering very clearly:
Engineering design shares certain characteristics with the positing
of scientific theories, but instead of hypothesizing about thebehavior of a given universe, whether of atoms, honeybees, orplants engineers hypothesize about assemblages of concrete andsteel that they arrange into a world of their own making” (p 43)
Every design is therefore subject to uncertainty; its rejection is only determined atthe time of its failure Engineering design certainly recognizes necessaryuncertainty and the limitations of reliability and systems safety However, there is
no conceptualization of the experimental settings within which their designs will
Trang 17be operated and tested The operating organization is missing from their worldview.
The space shuttle Challenger was the subject of several important worksthat deal with organizational culture In 1988, Starbuck and Milliken wrote ,“Themost important lesson to learn from the Challenger disaster is not that somemanagers made the wrong decisions or how o-rings worked: the most importantlesson is that fine-tuning makes failures very likely” (p 335) Fine-tuning is
“experimentation in the face of uncertainty” in the “context of very complexsociotechnical systems so its outcomes appear partially random”
Validating Normal Accident Theory
Our general theory of systems failure is derived from the extension of Normal Accident Theory to engineering principles The validity of Normal
Accident theory (NAT) requires demonstration
NAT is defined and exemplified by Charles Perrow (1984, 1999) as a model that anticipates the failure of complex, tightly coupled technical systems However, as an organizational study, neither complexity nor coupling were
defined in physical terms As discussed earlier, Wolf (2001)(Wolf &
Berniker,1999) operationalized interactive complexity in terms of the Second Law
of Thermodynamics and Boltzmann’s equation for entropy He then validated NAT on a population of petroleum refineries
Interactive complexity and tightness of coupling are the core of Normal Accident Theory (Perrow, 1984) Perrow defines complexity by illustration rather than by rigorous definition He illustrates his concept of interactive complexity