Teach Computer Science
Reliability in computer
systems
teachcomputerscience.com
2
Lesson Objectives
▪ Students will learn about things that can go wrong in a
computer system and how to avoid such situations.
▪ What are critical systems?
▪ How to protect systems from failure
▪ What to do if a computer system fails
▪ How to analyse reliability of a system
teachcomputerscience.com
1.
Content
teachcomputerscience.com
4
What is reliability?
▪ Reliability of any computer-related component is an attribute
that denotes its consistent performance according to the
specifications.
teachcomputerscience.com
5
Things that can go wrong
Hardware might fail to operate.
Software might contain bugs or Natural Hardware
errors. disasters failure
Human errors can also make the
system inefficient. Security is very System
important to systems as there might Failure
even be a deliberate attack. Software
Human
Natural disasters like power cuts, error error
flooding or an earthquake affect the
operation of systems too.
teachcomputerscience.com
6
What is a critical system?
▪ Critical systems are computer systems that must be highly
reliable, as their failure may have a great impact on human lives.
▪ Developed using a conservative technique rather than a new
technique.
▪ A new technique is only implemented after analysing its long-
term effects, even though it might seem to be more efficient.
teachcomputerscience.com
7
Types of critical systems
Critical system
Safety-critical Mission-critical Business-critical Security-critical
systems systems systems systems
teachcomputerscience.com
8
Types of critical systems
Safety-critical Mission-critical Business-critical Security-critical
systems systems systems systems
Designed to avoid Failure of these Designed to avoid Designed to protect
danger to human systems affects the loss of business, sensitive
lives and the overall economic loss and information that
environment. performance, as loss of reputation. can be misused
Example: they are responsible Example: Banking when in the wrong
Temperature for the goals in the systems hands.
control of nuclear system. Example: Defence
reactors. Example: navigation systems
systems of aircraft.
teachcomputerscience.com
9
What is backup?
▪ Duplicate data and files stored in a separate server or storage
drive to improve the reliability of a system are called backups.
▪ This protects the data from being lost due to failure.
▪ Backup is also useful when data is accidentally overwritten.
teachcomputerscience.com
10
Backup procedure
The team responsible for the backup
procedure performs the backup
according to a well-defined schedule.
Backup disks are to be stored in a
secure location. Back-up
Safe Scheduled
Disks and tapes secured in a
fireproof location are called an off-
site backup.
The data can also be backed-up over
Fire-safe
the Internet using cloud technology. Cloud-
technology
teachcomputerscience.com
11
Disaster recovery
Disaster recovery is the process of getting back lost data from the backup after a
system failure.
Let us consider the example of hardware failure. To recover from this failure, the
hardware is repaired or replaced with new hardware. The data is recovered from
the backup and copied to the hardware.
Examples of precautionary measures taken by an organisation to avoid disaster
are use of uninterruptible power supply (UPS), surge protectors (to minimise the
power surges in electronic equipment), fire prevention and anti-virus software.
teachcomputerscience.com
12
Redundancy
Redundancy is the duplication of Hardware
critical parts of a computer system to redundancy
improve reliability.
If the primary system fails, the
backup or reserve system steps in.
Redundancy is very important in Redundancy
critical systems like aircraft systems. Data
If any hardware or software fails Software
redundancy
during a flight, the redundant system redundancy
steps in to avoid failure.
teachcomputerscience.com
13
Types of redundancy
Hardware redundancy Software Data redundancy
Computer systems have an extra redundancy Redundant data in
critical hardware device to avoid Redundant software the backup can
failure. is used to replace the replace the original
Example: A system is provided original program in data in case the
with two power supplies in a case it fails. original data is lost or
parallel set up so that they can be overwritten
easily switched if one of them accidentally.
fails.
Redundant array of independent
disks (RAID): multiple physical
disk drives are used to store
redundant data. teachcomputerscience.com
14
What is fault-tolerance?
▪ Fault tolerance is a property that enables a system to operate
properly even if the system undergoes one or more failures.
▪ Essential for life-critical systems.
▪ This design enables a system to continue its operation, might
be at a reduced level, rather than failing completely, even when
some parts of the system fails.
▪ Data is protected from damage, intrusion or disclosure.
teachcomputerscience.com
15
What is fail-soft system?
▪ When a system gracefully fails, that is, operates at a reduced
level after some component failures, is called a fail-soft system.
▪ For example: a building may operate with reduced lighting and
elevators in case the power fails.
teachcomputerscience.com
16
Defensive programming
Software can be made more reliable by adding extra checks.
These checkpoints will warn the user in case the program is not working in the
desired manner. This is called defensive programming.
This enables the user to take action.
In the absence of these extra checks, the program would crash without any
warning.
teachcomputerscience.com
17
Measuring reliability
Time between failures
Time to repair Time to failure
Reliability of a system is measured using
various statistical parameters that are
used to predict how reliable the system is.
System Resumes normal System
failure operation failure
teachcomputerscience.com
18
Reliability factors
Percentage of time:
The percentage of time denotes the percentage of time for which the service was
available and operational during a particular month.
Number of hours:
Number of hours denotes the amount of time the system has operated without
reporting any problems.
teachcomputerscience.com
19
Reliability factors
Downtime:
The period during which a system breaks down or spends out of action. Zero
downtime refers to a system that is available all the time.
Mean time between failures (MTBF):
Meantime between failures is calculated by taking the average of the time
between failures of a system.
Meantime to failure (MTTF):
Mean time to failure is the time duration in which the system is expected to
continue its operation before system failure.
teachcomputerscience.com
20
Let’s review some concepts
Reliability Critical systems Backup
Reliability of any computer- Critical systems are computer Duplicate data and files stored in
related component is an systems that must be highly a separate server or storage
attribute that denotes its reliable as their failure may have drive to improve the reliability of
consistent performance a great impact on human lives. a system are called backup.
according to the specifications.
Redundancy Fault-tolerance Statistical parameters to
measure reliability
Redundancy is the duplication of Fault tolerance is a property that
critical parts of a computer enables a system to operate Percentage of time
system to improve reliability. properly even if the system
Number of hours
undergoes one or more failures.
(Hardware, software and data)
Downtime
Mean time between failures and
Mean time to failure
teachcomputerscience.com
2.
Activity
teachcomputerscience.com
22
Activity-1
Duration: 15 minutes
You are a programmer developing a banking system.
A. What are the important parts of this system? In what ways
could these parts fail?
B. How can you protect the system from possible failures?
teachcomputerscience.com
3.
End of topic questions
teachcomputerscience.com
24
End of topic questions
1. Where is backup stored?
2. What are the different types of redundancy? How are they
useful in improving the reliability of systems?
3. What is a fault-tolerant system?
4. What is a fail-soft system?
5. How can the reliability of a system be measured? Write down
the different parameters with a line of explanation.
teachcomputerscience.com