Process and Equipment Reliability Paper
Process and Equipment Reliability Paper
Sponsored by
Maintenance Technology, http://www.mt-online.com
Reliabilityweb.com, http://www.reliabilityweb.com
Process and Equipment Reliability
Paul Barringer, P.E.
Barringer & Associates, Inc., P. O. Box 3985, Humble, TX 77347-3985
Phone: 281-852-6810, FAX: 281-852-3749, e-mail: hpaul@barringer1.com
Abstract
Reliability for businesses begins with management and how they communicate the need for a
failure free environment to mobilize actions to preserve operable systems and processes. The need for
reliability considers cost of alternatives to prevent or mitigate failures, which require knowledge about
times to failure, and failure modes which are found by reliability technology. Justifications for
reliability improvements require knowing: 1) when things fail, 2) how things fail, and 3) conversions
of failures into time and money. Reliability engineering principles help define when and how things
fail to provide facts for life cycle costs comparisons. This helps decide the lowest long-term cost of
ownership driven by a single estimator called net present value for converting hardware issues and
alternatives into money issues. Several short examples illustrate the methodology.
Reliability
Processes, components, equipment, systems, and people are not perfect and not free from
failures. In a nave, simplistic, and deterministic view, we can have perfection with perfect reliability.
In the real world we fall short of perfection (perfection exists only in a fantasy world). Everything
fails either because of events or from aging deteriorations. A natural law of entropy expresses the
lowest energy state as a failurebuildings always fall down, they never fall up which means we must
continually maintain processes and equipment to prevent disorder and failures. This requires spending
time and resources to mitigating failure effects as nothing lasts forever.
Reliability is the probability that a component, system, or process will function without
failure for a specified length of time when operated correctly under specified conditions.
Reliability engineering is a strategic task concerned with predicting and avoiding failures. For
quantifying reliability issues it is important to know why, how, how often, and costs of failures.
Reliability issues are bound to the physics of failure mechanisms so the failure mechanisms can
be mitigated. In the real world all potential failures are seldom well known or well understood
which makes failure prediction a probabilistic issue for reliability analysis.
Reliability is not the same as availability although both are described as a value between 0
and 1. Availability tells the percent of time the system is alive and ready for use if called upon,
and stream factors define the actual online times as a percentage of up time. Reliability addresses
the probability for a failure free interval under specific conditions. Reliability is the sweet
absence of failures. Unreliability is the sour presence of failures which cost money.
Risk assessment models connect money with failures in a simple equation:
$Risk = (probability of failure during a specified time interval and under specific
conditions)*($Consequence of the failure event). $Risks always exist, and they are never zero.
How much $Risk is affordable becomes a business issue. If the business organization is risk
averse, then perhaps $Risk values must be less than say $10,000, or if risk accepting, then less
than say $100,000. Set the actual $Risk value as a business decision rather than backing into it
by failure to make a decision. Society expects planning for success and rejection of abnormal
$Risks. In the end, reliability is all about money for industrial businesses.
Page 1 of 11
Maintenance & Reliability Technology Summit, Chicago, IL, May 25, 2004
What does your reliability policy say?
Management presents important issues to the organization with policy statements. Policies
define specific areas of concern and indicate the desired outcome. Policies increase decisiveness
by removing uncertainty about action required to meet the objective. Policy statements
communicate information to the staff in general terms for detailed implementation by procedures
in a consistent fashion through individual acceptance and individual commitment. Good policies
reduce the potential for bad events such as inefficiency, counter productivity, inappropriate risk
taking, and conflicts over requirements so that nothing is implemented because of the void.
Modern organizations have safety policies, quality policies, and environmental policies.
These existing policies have clear and simple statements calling for zero lost time injuries, zero
shipments of defective products, and zero environmental releasesthese are reliability statements
in specific areas where they indicate the absence of failures. Reliability issues for processes and
equipment also need a clear and concise policy statement to avoid fuzzy interpretations.
For your policy, can you consider saying: We will build and maintain an economical and
failure free process which will operate for 5 years between planned outages. Management
must set the reliability policy (the general directive for what is intended). Policy drives
procedures (step-by-step instructions for implementing policies). Procedures drive the rules
(statements to take or not take actions).
Reliability policies must integrate safety, quality, risk, and financial requirements for the
company to achieve the business objectives. Reliability policies must be understandable to the
common person and come from top levels of management for credibility, legitimacy, constancy
of purpose for improvements, and setting the organization to work for a common objective.
Management has a big role in reliability issues, which guide design of equipment and
continues through maintenance of equipment and systems. Management must address the issues
and state the general requirements so everyone understands. The issue of reliability is to provide
a failure free environment for the process while expecting loss of components and equipment.
Management must think in terms of a chess game strategypawns will be lost (pumps, valves,
instrumentation), but dont loose the king, which is the process including pressure integrity and
product delivery system protecting human safety and environmental situations.
Management must also set an environment so everyone knows the cost for failuresit cant
be a secret for those that need to make financial decisions and procedures must be established to
communicate the costs of assumed values or calculated values for communicating to the
organization the high cost of certain failures. The fact that a human life is priceless does not
compute but society allows certain risks, which then allow calculated values for communication
purposes, as time/costs is the language of commerce, decisions, and action. Examples could be:
Assumed values-
Spill loss of 1 gallon of undesirable fluids = $500
Spill loss of 100 gallons of undesirable fluids = $3,000
Spill loss of 1,000 gallons of undesirable fluids = $250,000
Spill loss of 10,000 gallons of undesirable fluids = $1,900,000
Spill loss of 100,000 gallons of undesirable fluids = $20,000,000
Violation of Clean Waters Act provides civil penalties of $25,000/day or $1,000/bbl
of spilled material or $3,000/bbl for gross negligence.
Calculated values -
Accidental death of one person at (probability of failure=10-4) = $2,500,000,
multiply by 4 for disablement
Accidental death of 10 people at (pof=10-6) = $250,000,000, ditto for disablement
Assumed values only for communication purposes-
Other items to provide guidelines for failure events or failure avoidance including
lists of applicable specifications and regulations for compliance audits.
Page 2 of 11
Maintenance & Reliability Technology Summit, Chicago, IL, May 25, 2004
If the events occur in a sensitive area increase the cost by a factor of 5, if the events
occur in a benign environment, reduce the cost by a factor of 5.
Please note: These numbers are quantified merely to convert humanitarian and violation issues
into money (the language of business) so business trade-off decisions can be made, and the values
are not intended to be guidance values for lawyers, nor do they represent callous viewpoints
about the value of human life.
The point for inclusion of failure details into the procedures is to convert failure issues into
money as a guideline for the organization to think about tradeoffs so everyone can react in a
logical fashion to make honest and unemotional decisions concerning reliability matters. The
probability of failure for one person at 10-4 (Taylor 1994) is the same as reliability = 0.9999 and
10-6 is reliability = 0.999999. Reliability values for discussions about humans are more palatable
and less emotional than probabilities for human failure! Also note the $Risks for humans is at a
lower allowed value than $Risk for other typical business events.
Reliability Models
Reliability engineers discuss issues with reliability and availability models with a challenge,
which is balanced against a capacity for handling the challenge of adverse conditions (Modarres
1999). Some systems and processes have a specific inherent capacity to handle challenges. In
general, when the challenge is less than the capacity we have success; and when not, then we
have failure. The failure mechanisms are physical processes attacking the models.
Typical reliability models are:
Stress-Strength models where stress represents aggregation of challenges and external
conditions, and strength represents the variability of conditions affecting the capacity of system to
repel the challengers attempting to cause failure.
Damage-Endurance models are similar to stress-strength models but the stress causes
damages that are irreversible and cumulative such as corrosion, wear, embrittlement, and fatigue.
Failure occurs when the cumulative damage exceeds the capacity of the component or system
capacity to cope with the insults.
Challenge-Response models are conditions where components or systems fail but the failure
is not identified until a challenge occurs due to a critical event, as compared to typical situations
such as those that occur from the passage of time/cycles, etc.
Tolerance-Requirements models have failures only when the performance characteristics fall
outside of some predetermined fixed limits and gradual degradation occurs based on use or time.
Mechanical failure mechanisms are physical processes, which lead to or result from some
sort of stress (think of stress is general terms). The three broad classes are:
Stress-induced failure mechanisms are the cause or result of permanent or temporary stresses and
are categorized by brittle fracture, buckling, yield, impact, ductile fracture, or elastic deformation.
Strength-reduced failure mechanisms lead directly or indirectly to failure and are
categorized by wear, corrosion, cracking, diffusion, creep, radiation damage, or fretting.
Stress-increased failure mechanisms have a direct effect on increasing the applied stress and
are categorized by fatigue, radiation, thermal-shock, impact, and fretting.
Electrical failure mechanisms are often more complicated than the typical mechanical system
because of device and the packages in which they reside. Thus electrical/electronic failure
mechanisms are often divided into three broad classes:
Electrical stress mechanisms result from voltage levels that damage devices or degrade
electrical/electronic characteristics and are categorized by human error, uncontrolled currents,
uncontrolled voltages, localized heating/melting, and latent damage, resulting in later failures.
Intrinsic failure mechanisms are related to the electrical/electronic element and categorized
by electrically active layers in semiconductor chips, infant mortality problems built into the
Page 3 of 11
Maintenance & Reliability Technology Summit, Chicago, IL, May 25, 2004
devices by manufacturing/design problems, gate oxide breakdown, ionic contamination, surface
charge spreading, and hot electrons.
Extrinsic failure mechanisms are external failure mechanisms stemming from device
problems and problems with interconnections and undesirable environmental conditions.
Each failure mechanism can be accelerated in the field by unexpected combinations of
events. The events are often triggered by a situation acting as a catalyst for failure.
The age-old technique for coping with these challenge events has been to make the
components, systems, and processes very strong and keep the loads very low. For example,
Roman bridges built for horses and chariots now carry heavy trucks with great success.
Another method of mitigating failures is by maintenance engineering techniques, however,
maintenance cannot restore strengths in original designs which never existed. Thus direct
replacement maintenance efforts cannot improve inherent reliability, but only restore to the
original valueshowever, upgrade replacements can provide greater capacity.
Page 4 of 11
Maintenance & Reliability Technology Summit, Chicago, IL, May 25, 2004
Table 1
Short List Of Reliability Engineering Principles Tools
Mean time between failures Bathtub curves for modes of
indices failure
TPM and reliability principles Availability, maintainability,
Preparing reliability data for capability
analysis Critical items significantly
Decision trees merging reliability affecting safety/costs
and costs Quality function deployment
Weibull, normal, & log-normal Mechanical components testing
probability plots for interactions
Corrective action for Weibull Electronic device screening and
failure de-rating
Models & Monte Carlo Quality function deployment
simulations Reliability testing strategies
Pareto distributions for vital Accelerated testing
problems Contracting for reliability
Fault tree analysis Reliability growth models and
Design review displays
Load/strength interactions Cost of unreliability
Software reliability tools Reliability policies and
Sudden death and simultaneous specifications
testing Reliability audits
Failure recording, analysis and Managements role in reliability
corrective action improvements
Failure mode effect analysis
The three regimes for equipment failure are: 1) infant mortality, 2) chance failures, and 3)
wear out failures, which are connected to failure rates. Human errors are usually chance failures.
Infant mortality and old age wear out failures are superimposed on chance failures to obtain
the typical bathtub failure curve where we typically think of chance failures having a lower
failure rate than either wear out failures or infant mortality failures. The idealized bathtub curve
is seldom observed for equipmentwe have fewer pieces of equipment than we have human
failures (deaths) and all human deaths in civilized societies must be reported to government
agencies (mandatory reporting for equipment failures is not required).
The death of most equipment must be analyzed with Weibull analysis small samples using
(Abernethy 2000) for each failure mode, and software makes the analysis task easy (Fulton 2004). In
many cases a simple arithmetic technique of MTBF or MTTF is frequently used as a reliability
precursor mixtures of failure modes that occur.
Most reliability tools are practical engineering tools seldom studied in depth at most universities.
Usually the tools must be learned as supplements of continuing education either by home study or by
short coursessee the reading list (Barringer 2004 a).
A simple precursor of reliability is a criterion called mean time to failure for non-repairable
items and mean time between failures for repairable items. Consider the seal MTTF details in
Figure 1 showing the effects of a change in failure criteria caused by new regulations.
Page 5 of 11
Maintenance & Reliability Technology Summit, Chicago, IL, May 25, 2004
This MTTF data from production, maintenance, and purchasing records Remembe r: MTTF is a yardsticknot a micrometer!!
2 .5 0
MTTF (yrs)
run one-half 2 .5 0
2 .0 0 of the time for 2 .0 0
Before
1 .5 0 determining the 1 .5 0
After
Before Emission
1 .0 0 After number of 1 .0 0
Emission Monitoring Emission
0 .5 0 Emission operating hours. 0 .5 0
Monitoring Monitoring
0 .0 0 Monitoring 0 .0 0
1984 1986 1988 1990 1992 1994 1984 1986 1988 1990 1992 1994
Ye ar
Ye a r
The MTTF decline in Figure 1 results from a more severe criteria for failure. MTTF/MTBF
indicators help forecast the number of pump repairs expected during a time interval and thus help plan
maintenance demands on resources and costs.
Some systems are simple series models without redundancy as shown in Figure 2 where failure
of a single device causes the entire system to fail.
T = i = 1 + 2 + 3
1 2 3
i=n
R= Ri= R1 * R2 * R3 = e (- t)
R1 = e (-1t) R2 = e (-2t) R3 = e (-3t) i=1
N = 5 Elements in series
0.8
0.7 N = 10
0.6
0.5
N = 25
0.4
0.3 N = 50
0.2 N = 100
0.1
0
0.95 0.96 0.97 0.98 0.99 1
Individual Component Reliability
Figure 2: Simple Series System Reliability Model
Page 6 of 11
Maintenance & Reliability Technology Summit, Chicago, IL, May 25, 2004
Other systems are more durable by use of redundancies as shown in Figure 3. Redundancy means
two or more devices providing backups to reduce system risks for failure.
0.9975
0.95 0.96 0.97 0.98 0.99 1
Individual Component Reliability
Financial failure risk is reduce for dual or triple (or more) redundancy at the expense of more
investment. Financial outcomes and can be calculated with standard NPV calculation sheets as the
sustaining cost is know for each year and acquisition cost for each scenario can be estimated.
For those needing modeling/calculation assistance, the no-cost RAPTOR block diagram
software developed by the US Air Force is available for building complex reliability models
(Barringer 2004b). Data is required to drive all reliability models and a limited Weibull database
is available (Barringer 2004c), and data for reliability models should come from your own failure
database which reflects: 1) grade of equipment purchased, 2) maintenance/operating strategies
employed, and 3) organization of the database to clean the data using good practices such as
inclusion of suspended (censored) data, and identification of the data taxonomy (CCPS 1998).
An area of poor reliability lies in problems associated with
Table 2
humans for rapid response when monitoring, controlling, and
Time Available For Diagnosis
maintaining systems. The roots of failures, when broken into a Of An Abnormal Event After
Pareto distribution, are: 38% for humans, 34% for Control Room Annuciation
procedures/processes, and 28% for equipment. Human error
probabilities = (number of errors)/(number of opportunities for Time Probability Of
(minutes) Failure (%)
error) and human error rates = (number of errors)/(total task
1 ~100
duration). Table 2 shows the human probability of failure in a
10 50
control room to correctly diagnose an abnormal event
20 10
reliabilities are obtained by taking the complement of
30 1
probabilities of failure (AIChE 2000). This infers that 60 0.1
automation of control room functions is very important for 1500 0.01
improving system reliability. The popular, but erroneous,
Page 7 of 11
Maintenance & Reliability Technology Summit, Chicago, IL, May 25, 2004
concept is humans are reliable and equipment is unreliable which leads to overemphasizing
hardware faults and underemphasizes human faults. Human unreliability is often the dominate
factor in unreliability issues.
To an engineer, most issues concerning reliability are related to things (not people)things
that happen and things that mitigate failures. To managers, most issues concerning reliability are
related to costsparticularly the costs for loss of the process. This requires converting things
into money, which drives the need for alternatives. Seldom is the best engineering idea
implemented because in the real world money is scarce and the first cost is rarely the last cost as
things need constant attention and maintenance which ties all issues to life cycle costs and net
present value calculations for the time value of money.
Page 8 of 11
Maintenance & Reliability Technology Summit, Chicago, IL, May 25, 2004
Failure data and repair time data can be converted into statistical format using WinSMITH
Weibull software for use in reliability calculations. (Fulton 2004)
Few individuals claim knowledge of sustaining cost facts until someone else puts numbers
on the tablethen the critics are numerous for correcting the proposed numbers. Follow the
scientific method: build a hypothesis for failures and their cost and then test the hypothesis.
When in doubt about the failure data or cost, make an estimate and test the estimate for validity.
Much data needed for LCC comes from operating costs (including electricity, etc.) and
maintenance records which show times between failure and repair times. These details are often
associated with the field of reliability and maintainability with a direct relationship for finding lower
life cycle costs. The cost details should also include costs for lost gross margin for outages of systems
when it is appropriate. Some of the failure data is from simple arithmetic calculations and other data
follows the preferred method from Weibull databases.
Conditions for installation, operation, and maintenance influence both failures and failure costs,
which are susceptible to equipment grades for changing the financial performance. Often Monte
Carlo computer simulations, using random numbers, are required to find cost variability for different
equipment grades.
Your can build simple, low cost Monte Carlo reliability models using software available
from the Internet which is useful for driving life cycle cost decisions. (Barringer 2004e) The
reason for building reliability models is to find where failure cost is occurring and to search for
the lowest long-term cost of ownership where system details, when priced-out, provide a clear
leading alternative for solving the problems. The reliability models show whats affordable and
the less desirable alternatives.
Reliability models, using actual failure data and repair times, give system availability, reliability,
maintainability, and other operating system details which allow construction of costs and tradeoffs. A
clear definition of failure is important for reliability decisions and the data acquired.
If you do not have data collection systems for your failures, you will seldom make substantial
improvements in the reliability of your systems.
Reliability models provide evidence for tradeoff boxes. Engineers need graphics for
understanding whats happening to their systems. The tradeoff box has life cycle cost on the vertical
axis and effectiveness on the horizontal axis. Effectiveness is the product of availability, reliability,
maintainability, and capability of the system to perform. Complex items become simple when you
see the results shown in Figure 4. The left hand of Figure 4 symbolizes too many failures and the
right hand of Figure 4 symbolizes too much equipment. The sweet spot lies between the extremes.
The High Cost Of Large The High Cost Of Small Equipment
EquipmentToo Many With Too Many Redundancies
Outages And Too And Long Run Hours
Few Run Hours
Life Cycle Cost
Page 9 of 11
Maintenance & Reliability Technology Summit, Chicago, IL, May 25, 2004
What most companies need is the money and control of riskthey rarely need perfect
solutions! Life cycle cost helps provide the answers when driven by the tools of reliability
engineering. When you have concepts and features on a product or process that generate value,
the value must be quantified for inclusion in the life cycle cost model. .
Summary
A reliability policy focuses the organization toward a failure free environment to meet
management objectives for the business. Reliability engineering tools predict failures and risks
associated with certain actions. Details from the reliability analysis go into life cycle costs models for
merging engineering details into a format considering the time value of money. Life cycle concepts
rely heavily on reliability and maintainability technology issues to convert ideas into hard, engineering
facts which convert issues into a monetary values for making trade-off decisions about $Risks.
The first cost for procurement is not the last cost. Procurement cost may represent only a
small fraction of the total cost during the life of an item, and in other cases, it may be a large
portion of the total life cycle costsgeneral rules of thumb have much variance.
The engineering facts must be converted into financial details of NPV and IRR with a selection
of the best alternative from several courses of action. The decisions you make up front will be with
you for many years so its important to justify risks and improvements using the best tools available.
Without reliability details about failures, NPV are difficult to calculate for making correct
business decisions about tradeoffs with failure avoidance for reliability as a business focus.
If your owners, customers, or the general public are unhappy about reliability of your systems,
improvements must be madethat is a call for change. To get change, you must make a change. If
improvements are not voluntary and community outrage is high, trailing legislation will surely follow
in ways seldom to your advantage. Unreliability must be quantified and resolved quickly for business
purposes. Occasionally reliability begins at low levels in the organization with long incubation times
before recognizable progress is recorded; however in private enterprise, high system reliability is best-
achieved top down with policy statements to quickly put the organization to work for reaching a target
in reasonable time frames.
References
1. Abernethy, Robert B., The New Weibull Handbook, fourth edition, Dr. Robert B. Abernethy
author and publisher, 536 Oyster Road, North Palm Beach, FL 33408-4328, Phone/FAX: 561-
842-4082, e-mail: Weibull@worldnet.att.net, ISBN 0-9653062-1-6, 2000.
2. AIChE, Guidelines For Chemical Process Quantitative Risk Analysis, 2nd edition, American
Institute of chemical Engineers, New York, ISBN 0-8169-0720-X, 2000
3. Barringer, H. Paul, Reliability Engineering Principles, author and publisher, Barringer &
Associates, Inc., P.O. Box 3985, Humble, TX, 2004 the reading list is also available on the
Internet at http://www.barringer1.com/read.htm
4. Barringer, H. Paul, http://www.barringer1.com/raptor.htm, RAPTOR software for no cost
downloads, 2004.
5. Barringer, H. Paul, Weibull Database, http://www.barringer1.com/wdbase.htm, 2004
6. Barringer, H. Paul, Life Cycle Cost, author and publisher, Barringer & Associates, Inc., P.O.
Box 3985, Humble, TX, 2004
7. Barringer, H. Paul, Download Papers, http://www.barringer1.com/Papers.htm, 2004
8. CCPS Staff, Guidelines for Improving Plant Reliability through Data Collection and
Analysis, Center For Chemical Process Safety of the American Institute Of Chemical
Engineers, AIChE, New York, ISBN 0-8169-0751-X, 1998
9. Fabrycky, Wolter J. and Benjamin S. Blanchard, Life-Cycle Cost and Economic Analysis,
Prentice Hall, Englewood Cliffs, New Jersey, ISBN 0-13-538323-4, 1991
Page 10 of 11
Maintenance & Reliability Technology Summit, Chicago, IL, May 25, 2004
10. Fulton, Wes, WinSMITH Weibull probability plotting software,
http://www.weibullnews.com, 2004
11. Modarres, Mohammad, Mark Kaminskiy, Vasiliy Krivtsov, Reliability Engineering and
Risk Analysis, Marcel Decker, New York, ISBN 0-8247-2000-8, 1999
12. SAE M-110.2, Reliability and Maintainability Guideline for Manufacturing Machinery
and EquipmentSecond Edition, Society of Automotive Engineers, Warrendale, PA,
ISBN 0-7680-0473-X, 1999
13. Taylor, J.R., Risk Analysis for Process Plant, Pipelines and Transport, E & FN Spon,
New York, ISBN 0-419-19090-2, 1994
Biography
Paul Barringer, P.E. is a manufacturing, engineering, and reliability consultant with more than forty
years of engineering and manufacturing experience in design, production, quality, maintenance, and
reliability of technical products. Experienced in both the technical and bottom-line aspects of operating
a business with management experience in manufacturing and engineering for an ISO 9001 facility.
Industrial experience includes the oil and gas services business for high pressure and deep holes, super
alloy manufacturing, and isotope separation using ultra high speed rotating devices.
He is author of training courses: Reliability Engineering Principles for calculating the life of
equipment and predicting the failure free interval, Process Reliability for finding the reliability
of processes and quantifying production losses, and Life Cycle Cost for finding the most cost
effective alternative from many equipment scenarios using reliability concepts.
Barringer is a Registered Professional Engineer, Texas. Inventor named in six U.S.A. Patents and
numerous foreign patents. He is a contributor to The New Weibull Handbook, a reliability
handbook, published by Dr. Robert B. Abernethy.
His education includes a MS and BS in Mechanical Engineering from North Carolina State
University. He participated in Harvard University's three-week Manufacturing Strategy
conference.
Other reliability and life cycle costs details are available at http://www.barringer1.com
Page 11 of 11
Maintenance & Reliability Technology Summit, Chicago, IL, May 25, 2004