Fault-tree analysis for safety in engineering design is conducted in several steps, from defining the problem to constructing the fault tree, analysing the fault tree, and documenting th
Trang 1Consequence Consequence Consequence
No Condition
Time delay
Initiating event
Fault tree
Fault tree
Yes
No Condition Yes
Fig 5.3 Cause-consequence diagram
man et al 1994), and programmable user modelling applications (Blandford et al 1999) have emerged to reconcile deficiencies in the tree-based analysis techniques Furthermore, although the use of techniques are adequately suitable in designing for safety of process engineering designs, their use in designing for systems control
is complicated by the large number of ways that computational control can address,
or even contribute to, hazardous system states This problem is solved by the use
of a relatively new forward analysis technique called deviation analysis (Leveson
1995)
Deviation analysis (DA) is based on the underlying assumption that many
acci-dents or inciacci-dents are the result of deviations in system variables, where a devia-tion is the difference between the actual and correct values appropriate for system control The method originates from the forward analysis technique of software de-viation analysis (SDA) in which hazardous behaviour in system control software is analysed DA is an extension of the technique to system control hardware Deviation analysis determines whether hazardous systems behaviour can result from a class of input deviations inclusive in the broad range of process characteristics such as ca-pacity, input, throughput, output and quality It is a means of determining system
component robustness (or, in safety terminology, its survivability), or how it will
behave in an imperfect environment
Hazardous operability studies (HAZOP, short for hazard and operability), was
first introduced by engineers from ICI Chemicals in the UK, in the 1970s The
method entails the investigation of deviations from the design intent for a process
engineering installation by a design team with expertise in different areas such as engineering, operations, maintenance, safety and chemistry The team is guided in
a structured process, by using a set of guidewords to examine deviations from nor-mal process conditions at various key points (nodes) throughout the process The guidewords are applied to the relevant process parameters—for example, flow,
Trang 2tem-perature, pressure, composition, etc.—in order to identify the causes and conse-quences of deviations Typical terms used in a HAZOP are the following (Kletz
1999):
• Node: a specific location in the process in which (the deviations of) the process intention are evaluated.
• Intention: description of how the process is expected to behave at the node;
this is qualitatively described as an activity (e.g feed, reaction, sedimentation) and/or quantitatively in the process parameters, like temperature, flow rate, pres-sure, composition, etc
• Deviation: a way in which the process conditions may depart from their
inten-tion
• Parameter: the relevant parameter for the condition(s) of the process; e.g
pres-sure, temperature, composition, etc
• Guideword: a short word to describe a deviation of the intention The mostly
used guidewords are NO, MORE, LESS, AS WELL AS, PART OF, OTHER THAN and REVERSE In addition, guidewords like TOO EARLY, TOO LATE, INSTEAD OF, etc are used, the latter mainly for batch-like processes The guidewords are applied, in turn, to all parameters, in order to identify unex-pected and yet credible deviations from the intention
• Cause: the reason(s) why the deviation could occur Many causes could be
iden-tified for one deviation
• Consequence: the results of the deviation, in case it occurs Consequences may
comprise both process hazards and operability problems, like plant shutdown or quality decrease of the product Many consequences can follow from one cause and, in turn, one consequence can have several causes
• Safeguard: facilities that help to reduce the occurrence frequency of the
devia-tion or to mitigate its consequences There are five types of safeguards:
a) Facilities that identify the deviation These comprise, among others, alarm
instrumentation and human operator detection
b) Facilities that compensate the deviation, e.g an automatic control system
that reduces the feed to a vessel in case of overfilling (increase of level) These usually are an integrated part of the process control
c) Facilities that avoid the deviation from occurring.
d) Facilities that prevent deviation from escalating (e.g trips) These facilities
are often interlocked with several units in the process, and controlled by logical computers
e) Facilities that relieve the process from the hazardous deviation These
com-prise, for instance, pressure safety valves (PSV) and vent systems
• Recommendation: activities identified during a HAZOP study for follow-up.
These may comprise technical improvements in the design, modifications in the status of drawings and process descriptions, procedural measures to be de-veloped or further in-depth studies to be carried out
Trang 35.2.1.1 Fault-Tree Analysis for Safety in Engineering Design
The concept of fault-tree analysis (FTA) was originated by Bell Telephone
Labora-tories in the 1960s as a technique to perform a safety evaluation of the Minutemen Intercontinental Ballistic Missile Launch Control System A fault tree is a logical diagram that shows the relation between system failure, i.e a specific undesirable event in the system, and failures of the components of the system It is a technique based on deductive logic An undesirable event is first defined and causal relation-ships of the failures leading to that event are then identified Fault trees can be used
in qualitative or quantitative risk analysis The difference between the two is that the qualitative fault tree is linguistic in structure and does not require use of the same rigorous logic as does the formal quantitative fault tree (cf Fig 5.4)
FTA is a deductive technique that focuses on a particular accident or failure, and provides a method for determining causes of that event Fault-tree diagrams use log-ical operators, principally the OR and AND gates The terminology is derived from electrical circuits, the term ‘gate’ referring to the control of a signal or electrical cur-rent The term OR denotes a choice between two or more signals, either of which can ‘open’ the gate The AND term refers to the requirement that both signals are necessary before there is an output from the gate Figure 5.4 shows the logic and event symbols used in FTA
Fault-tree analysis for safety in engineering design is conducted in several steps, from defining the problem to constructing the fault tree, analysing the fault tree, and documenting the results, specifically:
OR gate
The output event occurs if any of the input events occur
AND gate
The output event occurs only when all input events occur
Intermediate event
A fault that results from the interactions of other fault events
Basic event
A component failure that requires no further development
Undeveloped event
A fault that is not examined further because information is unavailable
Transfer IN/OUT symbols
IN indicates the tree is developed further at a corresponding OUT symbol
IN
AND
OR
Undeveloped
event
Basic
event
Intermediate
event
Fig 5.4 Logic and event symbols used in FTA
Trang 4Step 1 Defining the Problem
The engineering design team selects:
• the top event,
• the boundary conditions,
• system physical bounds,
• the level of systems resolution,
• initial conditions,
• events that are not allowed,
• existing conditions,
• conditional assumptions.
Defining the top event is one of the most important aspects of the first step The top event is the accident (or undesired event) that is the subject of the FTA The top event is often identified through other hazard analysis studies (such as HAZID) Top events should be precisely defined for the system or plant being evaluated, because analysing broadly scoped or poorly defined top events can often lead to an inefficient analysis
For example, a top event of ‘gas leaks in the plant’ is too general Instead, an appropriate top event would be ‘gas leak in the HC piping of the acid separation plant precipitation tank B’
The physical system boundaries encompass the system’s equipment, the equip-ment’s interfaces with other processes, and the utility/support systems that are to be included in the FTA The design team should also specify the level of systems reso-lution for the fault-tree events For example, a motor-operated valve can be included
as a single item of equipment (i.e component) or it can be described as several hard-ware items (i.e parts, e.g the valve body, valve internals, and motor operator) The systems resolution of the FT should be limited to the detail needed to satisfy the analysis objective, and should parallel the resolution of the available information The initial equipment configuration or initial operating conditions describe the system in its normal, unfailed state Events that are not allowed are, for the purposes
of the FTA, events that are considered to be unlikely or that are not to be consid-ered in the analysis, for some exclusive reason For example, wiring failures may
be excluded from the analysis of an instrument system, or cabling may be excluded from the analysis of power generating units Existing conditions within which the system functions are estimates (and assumptions) of the possible operational con-ditions that may arise within the system and its equipment, either as a result of the system’s inherent complexity, or as a result of the complex integration of various systems
Step 2 Constructing the Fault Tree
The FTA begins at the top event and proceeds, level by level, until all fault events
have been traced to their basic contributing causes (i.e basic events) At each level,
Trang 5Fig 5.5 Safety control of cooling water system
the immediate, necessary and sufficient causes are defined that would result in the intermediate or top event under consideration The analysis continues at each level, until basic causes or the analysis boundary conditions are reached
Returning to the simple fault tree of a cooling water system depicted in Fig 3.19
of Sect 3.2.2.6 dealing with fault-tree analysis in reliability assessment, assume that the systems design included provision for a back-up surge tank with an appropriate control alarm in the event the tank over-flowed, indicating problems with the cooling water feed These problems would typically be:
Excess inflow
Low surge outflow
Control alarm failure
Operator error
Figure 5.5 shows an example of the cooling water surge tank fault tree with two levels below the top event
Step 3 Analysing the Fault Tree
The analysis ‘solves’ the fault tree by identifying combinations of failures that can
lead to accidents These are called minimal cut sets (MCS) The minimal cut sets
for the example shown in Fig 5.5 would be:
Trang 6• ‘No surge control’ and ‘No alarm control’
• ‘Excess inflow’ and ‘Alarm failure’
• ‘Excess inflow’ and ‘Operator error’
• ‘Low surge outflow’ and ‘Alarm failure’
• ‘Low surge outflow’ and ‘Operator error’.
If the states of each of the control valves (CV1 and CV2) are in failure mode (i.e failed closed and failed open), then further low-level cut sets can be defined, and the fault tree needs to be modified (additional rectangular boxes above each CV circular
box) to include the failed states:
• ‘CV1 fails open’ and ‘Alarm failure’
• ‘CV1 fails closed’ and ‘Alarm failure’
• ‘CV2 fails open’ and ‘Alarm failure’
• ‘CV2 fails closed’ and ‘Alarm failure’.
Failure probabilities can now be assigned The probabilities that are allocated to the events can be combined to estimate the probability of the top event The
probabil-ity of two events, the one with probabilprobabil-ity p1 and the other with probability p2, occurring together are:
and q1and q2are the complements of p1and p2respectively:
q1= 1 − p1
q2= 1 − p2
Then: q1is ‘NOT p1’ and: q2is ‘NOT p2’
The probability of event 1 not occurring is thus q1and the probability of event 2
not occurring is q2 Thus, for event 1 OR event 2 to occur, the probability of the
combination that either does not occur—that is, that one of the two occurs—is given
by the following expression:
The concept of this expression can be clarified by the following example In Fig 5.5, the probabilities of the equipment failures in the circles are derived from expert judgement, and the activities in the rectangular boxes are calculated from frequen-cies further down the tree
The probability for no surge control is calculated as:
P(OR) valves = 1 − [(1 − 0.025) × (1 − 0.025)]
= 0.050
The probability for no alarm control is calculated as:
P(OR) alarm = 1 − [(1 − 0.025) × (1 − 0.052)]
= 0.075
Trang 7The probability for the top event shown in the figure (tank overflow) is:
P(AND) tank = 0.050 × 0.075
= 0.00375
Although the example is hypothetical, it closely resembles a real-world scenario
in which it is interesting to note that the safety alarm control system’s reliability is lower than that of the surge system it is meant to control! This is due to operator error where operator judgement is jeopardised by failure in the operator control panel (OCP)–which, in many processes, is often the case The failure of an item of equipment will result in its replacement, which reduces the failure frequency, and which then changes the risk probabilities all the way up the tree
The use of computer models is necessary to maintain the fault-tree analysis up
to date It is common in large process plants, however, for the maintenance group not to communicate these improvements to the reliability engineers who continue to use outdated high-risk numbers Similarly, experiences of ineffective operation will usually initiate improved training, so that operator errors are less frequent and the reliability of the whole system is improved
Step 4 Documenting the Results
The analysis should provide a description of the system, a discussion of the problem definition, a list of assumptions, the fault-tree model(s) that were developed, lists
of minimal cut sets, and an evaluation of the significance of the MCSs and any recommendations that arise from the FTA
Probability evaluation of fault trees is considered in most technical papers and books about safety and hazard analysis However, some approximation discrepan-cies are evident, especially in the basic theory of assigning probabilities to the fault-tree gates—specifically, the OR gate
The probability expression for the statistically independent input events for the
OR gate has been given as, (Dhillon 1983):
P(OR) = P(a + b + c + etc.) (5.3)
P(OR) = P(a) + P(b) + P(c) + etc.
a ,b,c, etc = input events
In the example of Fig 5.5, this is equivalent to:
P (OR) = p3+ p4 or p5+ p6
= 0.050 or 0.077 Considering the complements of p1and p2, namely q1and q2, results in:
P (OR) = 1 − (q3× q4) or 1 − (q5× q6)
= 0.049375 or 0.0757
Trang 85.2.1.2 Root Cause Analysis for Safety in Engineering Design
Root cause analysis is predominantly a technique for determining the origin of
causes of failure in engineered installations after completion of their design
How-ever, the approach can also be used to identify potential root causes of failure, par-ticularly failures with critical safety consequences, during the engineering design
process before systems manufacture, installation and/or construction The
funda-mental need for design engineers to consider how their designs operate in the field and, more importantly, how they fail is imperative to successfully achieving integrity
in engineering design This will ultimately result in engineering designs that sat-isfy both functional and integrity requirements, using sound engineering judgement, rather than ‘crystal ball’ prediction techniques
Although there is a wealth of knowledge and data concerning systems perfor-mance of existing engineered installations, in general this is not utilised to the ex-tent that information may be obtained for use in new designs, especially in complex integrations of designs To this end, more formal and systematic methods should be introduced during the engineering design process
Although specific methods and tools are available to facilitate designing for reli-ability, for example, their use is often limited to reliability engineers, with the design engineers of other disciplines frequently adopting an intuitive approach to consider-ing reliability in design As the design process becomes increasconsider-ingly sophisticated with higher-level design tasks of complex integrations of similarly complex sys-tems, it has become essential that design engineers formally investigate the integrity
of these designs, particularly at each interface of the integrated systems
Examining and understanding the root cause of failure of a design’s functional
operation can aid in designing for safety and designing-out unreliability In select-ing equipment from an existselect-ing design to meet a new requirement within different systems integration, it is important that design engineers look beyond the standard reliability metric of the existing design, and review in particular the root causes of failure and significant factors affecting the equipment’s reliability and safety In the past, there has been an over-reliance on the use of prediction methods For exam-ple, the original reliability prediction handbook of the USA Department of Defence (DoD), MIL-HDBK-217, contained failure rate models for the various part types used in electronic systems, and concentrated mainly on the use of prediction meth-ods that did not provide engineers with any knowledge of what might fail in service (MIL-HDBK-217F 1998)
A methodology aimed at integrating reliability enhancement practices into the engineering design process has been developed as part of a UK government and aerospace industry initiative As a result, the Reliability Enhancement Methodol-ogy and Modelling (REMM) project was funded in part by the UK’s Department of Trade and Industry through the Civil Aviation Research and Development program and by industrial partners involved (Marshall et al 1998) The main objectives of the project are to develop a methodology that supports reliability enhancement in engineering design and to develop a model that facilitates reliability assessment
throughout a system’s life cycle REMM is primarily used within the aerospace
Trang 9environment but the methodology and model developed are equally applicable to other high-reliability system designs, such as in process, chemical and mechanical engineering design projects A number of simple practical analyses for use by design engineers, during the early stages of systems realisation, have been developed as part of the REMM methodology These analyses are aimed at improving high-level decision-making using simple graphical representations of reliability data, such as analyses of root causes, trends, and manufacturing data
These graphical representation analyses include:
• Root cause analysis and classification of events into high-level failure categories,
providing the means to determine those factors that have most effect on the sys-tem’s service reliability and, hence, which elements should be tackled as a prior-ity
• Root cause and trend data across specific criteria such as equipment type, periods
of time (e.g particular manufacturing time-line points), application or use, pro-viding further understanding of the nature of the failure that may be characteristic
of the environment in which it is operating
• Manufacturing data analysis, providing valuable insight into the factors that
af-fect service reliability Correlation between manufacturing methods and service requirements can often illuminate small changes in design and manufacturing process that result in significant effects on service reliability
Root cause analysis also utilises the deductive logic tree approach, similar to fault-tree analysis (FTA), in establishing the root causes of functional failure or of a sys-tem state Such an approach to problem solving is particularly useful for determining safety in engineering designs
The approach of establishing the root causes of functional failure in systems design is intended to achieve the following:
• To organise and control design integrity problem identification.
• Provide a visual checklist to ensure all pertinent areas are covered.
• Allow for a standardised approach to safety problem identification.
• Serve as a documented guide for design integrity problem reviews.
The most common root cause analysis methods cover topics from events and causal factor analysis to change analysis, barrier analysis, management oversight and risk assessment, human performance evaluation, standard problem solving and basic decision-making
These methods are considered in the common root cause analysis approach
de-veloped by the Office of Nuclear Energy, US Department of Energy in their DOE guideline DOE-NE-STD-1004-92, and ‘Root cause analysis: guidance document’ (DOE-NE-STD-1004-92 1992)
Trang 10Common Root Cause Analysis Methods
• Events and causal factor analysis identifies the time sequence of a series of tasks
and/or actions and the surrounding conditions that can lead to a failure occur-rence The results are displayed in an events and causal factor chart that gives
a picture of the relationships of the events and causal factors
• Change analysis is used when the problem is obscure It is a systematic process
that is generally used for a single failure occurrence and focuses on elements that change
• Barrier analysis is a systematic process that can be used to identify physical, and
procedural barriers or controls that should prevent the occurrence of failure
• Management oversight and risk tree (MORT) analysis is used to identify
inad-equacies in barriers/controls, specific barrier and support functions, as well as management functions It identifies specific factors relating to a possible failure occurrence and identifies factors that permit these factors to exist
• Human performance evaluation identifies those factors that influence task
perfor-mance The focus of this analysis method is on operability, work environment, and management factors, as well as man-machine interface studies to improve performance
• Problem solving and decision-making provides a systematic framework for
gath-ering, organising and evaluating information, and applies to all phases of a pos-sible failure occurrence investigation (Kepner et al 1981)
By organising problem analysis results in an orderly manner as the design pro-gresses, the time spent to find the root causes of possible problems is minimised
The method consists of using factor trees to guide the course of the analysis Factor
trees diagrammatically present the major areas to be considered in the various stages
of an engineering design project, such as:
• Systems and equipment design.
• Manufacturing and installation.
• Process start-up and ramp-up.
• Operations and maintenance.
To conduct a root cause analysis specifically in the systems and equipment design stage, a series of charts can be developed representing those functional areas to be investigated, and the various factors to be considered when investigating the func-tional areas for causes of potential failure problems These root cause factors for the systems and equipment design area include the following:
• Origin of design criteria.
• Utility inputs prior to design.
• Equipment specifications.
• Constraints on the design.
• Actual design solution and test.