Fault Tolerance Mechanisms in Distributed Systems
Fault Tolerance Mechanisms in Distributed Systems
net/publication/287198069
CITATIONS READS
28 10,579
2 authors:
Some of the authors of this publication are also working on these related projects:
Call for Chapters: Applying Methods of Scientific Inquiry into Intelligence, Security, and Counterterrorism View project
Internet of Things and Smart City Initiatives in Middle Eastern Countries View project
All content following this page was uploaded by Arif Sari on 17 December 2015.
Abstract
The use of technology has increased vastly and today computer systems are interconnected via
different communication medium. The use of distributed systems in our day to day activities has
solely improved with data distributions. This is because distributed systems enable nodes to or-
ganise and allow their resources to be used among the connected systems or devices that make
people to be integrated with geographically distributed computing facilities. The distributed sys-
tems may lead to lack of service availability due to multiple system failures on multiple failure
points. This article highlights the different fault tolerance mechanism in distributed systems used
to prevent multiple system failures on multiple failure points by considering replication, high re-
dundancy and high availability of the distributed services.
Keywords
Fault Tolerance, Distributed System, Replication, Redundancy, High Availability
1. Introduction
A faulty system creates a human/economic loss, air and rail traffic control, telecommunication loss, etc. The
need for a reliable fault tolerance mechanism reduces these risks to a minimum. In distributed systems, faults or
failures are limited or part. A part failure in distributed systems is not equally critical because the entire system
would not be offline or brought down, for example a system having more than one processing cores (CPU), and
if one of the cores fails the system would not stop functioning as though that’s the one physical core in the sys-
tem. Hence, the other cores would continue to function and process data normally. Nevertheless, in a non-dis-
tributed system when one of its components stops functioning, it would lead to a malfunction of the entire sys-
tem or program and the corresponding processes would stop.
Fault tolerance is the dynamic method that’s used to keep the interconnected systems together, sustain relia-
How to cite this paper: Sari, A. and Akkaya, M. (2015) Fault Tolerance Mechanisms in Distributed Systems. Int. J. Commu-
nications, Network and System Sciences, 8, 471-482. http://dx.doi.org/10.4236/ijcns.2015.812042
A. Sari, M. Akkaya
bility, and availability in distributed systems. The hardware and software redundancy methods are the known
techniques of fault tolerance in distributed system. The hardware methods ensure the addition of some hardware
components such as CPUs, communication links, memory, and I/O devices while in the software fault tolerance
method, specific programs are included to deal with faults. Efficient fault tolerance mechanism helps in detect-
ing of faults and if possible recovers from it.
There are various definitions to what fault tolerance is. In dealing with fault tolerance, replication is typically
used for general fault tolerance method to protect against system failure [1] [2]. Sebepou et al. highlighted three
major forms of replication mechanism which are [1] [2]:
The State Machine;
Process Pairs;
Roll Back Recovery.
1) State Machine
In this mechanism, the process state of a computer system is replicated on autonomous computer system at
the same time, all replica nodes process data in analogous or matching way and also there’s coordination in their
process among the replica nodes and all the inputs are sent to all replica at the same time [2] [3]. An active rep-
lica is an example of state machine [3] [4].
2) Process Pairs
The process pairs functions like a master (primary)/slave (secondary) link in replication coordination. The
primary workstation acts in place of a master to transmit its corresponding input to the secondary node. Both
nodes maintain a good communication link [3]-[5].
3) Roll Back Recovery (Check-Point-Based)
This mechanism collects check point momentarily and transfers these checkpoint states to a stable storage de-
vice or backup nodes. This enables a roll back recovery to be done successfully when or during recovery process.
The checkpoint is been reconstructed prior to the recent state [3]-[6].
2. Distributed System
Distributed system are systems that don’t share memory or clock, in distributed systems nodes connect and relay
information by exchanging the information over a communication medium. The different computer in distri-
buted system have their own memory and OS, local resources are owned by the node using the resources. While
the resources that is been accessed over the network or communication medium is known to be remote resources
[5]-[7]. Figure 1 shows the communication network between systems in the distributed environment.
In distributed system, pool of rules are executed to synchronise the actions of various or different processes
on a communication network, thereby forming a distinct set of related tasks [6]-[9]. The independent system or
computers access resources remotely or locally in the distributed system communication environment, these
Processor
Memory Processor
Memory
Processor
Memory
472
A. Sari, M. Akkaya
resources are put together to form a single intelligible system. The user in the distributed environment is not
aware of the multiple interconnected system that ensures the task is carried out accurately. In distributed system,
no single system is required or carries the load of the entire system in processing a task [8] [9].
Distributed applications
Middleware service
Network
Figure 2. A simple architecture of a distributed system.
473
A. Sari, M. Akkaya
In the partially connected network, some node have direct links while others don’t. Some models of partially
connected networks are star structured networks, multi-access bus net work, ring structured network, and tree
structured network. In Figures 4-7 illustrates the corresponding networks. The disadvantages in these network
474
A. Sari, M. Akkaya
in are: in the Star designed network, when the main node fails to function the entire networked system stops to
function they collapse. In multi-access bus network, nodes are connected to each other through a communication
link “a bus”. If the bus link connecting the nodes fails to function, all other nodes can’t connect to each other,
and the performance of the network drops as more nodes or computers are added to the system or heavy traffic
occurs in the system. In the ring network, nodes are connected at least to two other nodes in the network creating
a path for signals to be exchanged between the connected nodes. As new nodes are added to the network, the
transmission delay becomes longer. If a node fail every other node in the network can be inaccessible. In the tree
structured network, this is like a net work with hierarchy, each node in the network have a fixed number nodes
that is attached to it in the sub level of the tree. In this network messages that are transmitted from the parent to
the child nodes goes through one link.
For a distributed system to perform and function according to build, it must have the following characteristics;
Fault Tolerant, Scalability, Predictable Performance, Openness, Security, and Transparency.
475
A. Sari, M. Akkaya
The reduction is due to the decrease or slowdown in the operational hardware configuration or it may be some
design faults in the hardware.
476
A. Sari, M. Akkaya
Replication Protocol
replica1
replica2
Replica Manager
replica3
Consistency Management
477
A. Sari, M. Akkaya
p2
p2 p3
p1
p1
p2 p3
node
node
node node
dispatcher
recovery are the checkpoint and log based roll back recovery technique. Each of the type of rollback recovery
technique uses different mechanism; the checkpoint based uses the checkpoints states that it has stored in a sta-
ble storage device, while the log based rollback recovery techniques combines both check pointing and logging
of events [51].
In recovery form system failures, there are two type of check point technique that is used; coordinated and
uncoordinated checkpoint techniques, these techniques are related with message logging [34].
Coordinated Check Point: In this technique, check are coordinated to save a consistent state because the coor-
dinated checkpoint are consistent set of checkpoint. If the checkpoints are not consistent a full and complete
rollback of the system can’t be done [52]. In a situation where there’s frequent failure, coordinated check point
technique can’t be used. The recovery time can be set to a higher value or lower value, when set to a lower value,
it improves performance of the technique because it only select the recovery to last correct state of the system
instead from the very first state or checkpoint.
Uncoordinated Check Point: This technique combines the message logging to ensure that the rollback state is
correct. The uncoordinated check point technique executed checkpoints independently as well as recovery.
There are three type of message logging protocols: optimistic, pessimistic and casual. In the optimistic protocol
ensures all messages are logged. The pessimistic protocol makes sure that all the message that is received by a
process are logged appropriately and stored in a stable and reliable storage media before it is forwarded into the
system. While the causal protocol just log the message information of a process in all processes that are causally
dependent [53].
478
A. Sari, M. Akkaya
M1 M2
FUSE
Table 2. shows compares the different fault tolerance technique or mechanism in distributed system.
performance can be improved by focusing on or addressing the serious aspects involved. In all the techniques
involved, there is strong need for reliable, accurate and pure adaptive multiple failure detector mechanism [53],
[54].
7. Conclusion
Fault tolerance is a major part of distributed system, because it ensures the continuity and functionality of a sys-
tem at a point where there’s a fault or failure. This research showed the different type of fault tolerance tech-
nique in distributed system such as the Fusion Based Technique, Check Pointing and Roll Back Technique, and
Replication Based Fault Tolerance Technique. Each mechanism is advantageous over the other and costly in
deployment. In this paper we highlight the levels of fault tolerance such as the hardware fault tolerance which
ensures that additional backup hardware such as memory block, CPU, etc., software fault tolerance system
comprises of checkpoints storage and rollback recovery mechanisms, and the system fault tolerance is a com-
plete system that does both software and hardware fault tolerance, to ensure availability of the system during
failure, error or fault. Future research would be conducted on comparing the various data security mechanisms
and their performance metrics.
References
[1] Sebepou, Z. and Magoutis, K. (2011) CEC: Continuous Eventual Checkpointing for Data Stream Processing Operators.
Proceedings of IEEE/IFIP 41st International Conference on Dependable Systems and Networks, 145-156.
http://dx.doi.org/10.1109/dsn.2011.5958214
[2] Sari, A. and Necat, B. (2012) Impact of RTS Mechanism on TORA and AODV Protocol’s Performance in Mobile Ad
Hoc Networks. International Journal of Science and Advanced Technology, 2, 188-191.
[3] Chen, W.H. and Tsai, J.C. (2014) Fault-Tolerance Implementation in Typical Distributed Stream Processing Systems.
479
A. Sari, M. Akkaya
[4] Sari, A. and Necat, B. (2012) Securing Mobile Ad Hoc Networks against Jamming Attacks through Unified Security
Mechanism. International Journal of Ad Hoc, Sensor & Ubiquitous Computing, 3, 79-94.
http://dx.doi.org/10.5121/ijasuc.2012.3306
[5] Balazinska, M., Balakrishnan, H., Madden, S. and Stonebraker, M. (2008) Fault-Tolerance in the Borealis Distributed
Stream Processing System. ACM Transactions on Database Systems, 33, 1-44.
http://dx.doi.org/10.1145/1331904.1331907
[6] Sari, A. (2014) Security Approaches in IEEE 802.11 MANET—Performance Evaluation of USM and RAS. Interna-
tional Journal of Communications, Network, and System Sciences, 7, 365-372.
http://dx.doi.org/10.4236/ijcns.2014.79038
[7] Elnozahy, E.N.M., Alvisi, L., Wang, Y.M. and Johnson, D.B. (2002) A Survey of Rollback-Recovery Protocols in
Message-Passing Systems. ACM Computing Surveys, 34, 375-408. http://dx.doi.org/10.1145/568522.568525
[8] Sari, A. (2014) Security Issues in RFID Middleware Systems: A Case of Network Layer Attacks: Proposed EPC Im-
plementation for Network Layer Attacks. Transactions on Networks & Communications, 2, 1-6.
http://dx.doi.org/10.14738/tnc.25.431
[9] Andrew, S. (1995) Tanenbaum Distributed Operating Systems. Prentice Hall, Upper Saddle River.
[10] Sari, A. (2015) Lightweight Robust Forwarding Scheme for Multi-Hop Wireless Networks. International Journal of
Communications, Network and System Sciences, 8, 19-28. http://dx.doi.org/10.4236/ijcns.2015.83003
[11] Coulouris, G., Dollimore, J. and Kindberg, T. (2001) Distributed Systems: Concepts and Design, 4th Edition, Pearson
Education Ltd., New York.
[12] Carter, W.C. and Bouricius, W.G. (1971) A Survey of Fault-Tolerant Computer Architecture and Its Evaluation. Com-
puter, 4, 9-16.
[13] Short, R.A. (1968) The Attainment of Reliable Digital Systems through the Use of Redundancy—A Survey. IEEE
Computer Group News, 2, 2-17.
[14] Sari, A. (2015) Two-Tier Hierarchical Cluster Based Topology in Wireless Sensor Networks for Contention Based
Protocol Suite. International Journal of Communications, Network and System Sciences, 8, 29-42.
http://dx.doi.org/10.4236/ijcns.2015.83004
[15] Cooper, A.E. and Chow, W.T. (1976) Development of On-Board Space Computer Systems. IBM Journal of Research
and Development, 20, 5-19. http://dx.doi.org/10.1147/rd.201.0005
[16] Tanenbaum, A. and Van Steen, M. (2007) Distributed Systems: Principles and Paradigms. 2nd Edition, Pearson Pren-
tice Hall, Upper Saddle River.
[17] Koren, I. and Krishna, C.M. (2007) Fault-Tolerance Systems. Elsevier Inc., San Francisco.
[18] Sari, A. and Onursal, O. (2013) Role of Information Security in E-Business Operations. International Journal of In-
formation Technology and Business Management, 3, 90-93.
[19] Avizienis, A., Kopetz, H. and Laprie, J.C. (1987) Dependable Computing and Fault-Tolerant Systems, Volume 1: The
Evolution of Fault-Tolerant Computing. Springer-Verlag, Vienna, 193-213.
[20] Sari, A. and Çağlar, E. (2015) Performance Simulation of Gossip Relay Protocol in Multi-Hop Wireless Networks. So-
cial and Applied Sciences Journal, Girne American University, 7, 145-148.
[21] Harper, R., Lala, J. and Deyst, J. (1988) Fault-Tolerant Parallel Processor Architectural Overview. Proceedings of the
18st International Symposium on Fault-Tolerant Computing, Tokyo, 27-30 June 1988.
[22] Sari, A. and Rahnama, B. (2013) Addressing Security Challenges in WiMAX Environment. In: Proceedings of the 6th
International Conference on Security of Information and Networks, ACM Press, New York, 454-456.
http://dx.doi.org/10.1145/2523514.2523586
[23] Briere, D. and Traverse, P. (1993) AIRBUS A320/A330/A340 Electrical Flight Controls: A Family of Fault-Tolerant
Systems. Proceedings of the 23rd International Symposium on Fault-Tolerant Computing, Toulouse, 22-24 June 1993.
[24] Charron-Bost, B., Pedone, F. and Schiper, A. (2010) Replication: Theory and Practice. Lecture Notes in Computer
Science, 5959.
[25] Sari, A. (2015) Security Issues in Mobile Wireless Ad Hoc Networks: A Comparative Survey of Methods and Tech-
niques to Provide Security in Wireless Ad Hoc Networks. New Threats and Countermeasures in Digital Crime and
Cyber Terrorism, IGI Global, Hershey, 66-94.
[26] Sari, A. and Rahnama, B. (2013) Simulation of 802.11 Physical Layer Attacks in MANET. Proceedings of the Fifth
International Conference on Computational Intelligence, Communication Systems and Networks (CICSyN), Madrid,
5-7 June 2013, 334-337. http://dx.doi.org/10.1109/cicsyn.2013.79
[27] Tanenbaum, A.S. and van Steen, M. (2002) Distributed Systems: Principles and Paradigms. Pearson Education Asia.
480
A. Sari, M. Akkaya
[28] Sari, A., Rahnama, B. and Caglar, E. (2014) Ultra-Fast Lithium Cell Charging for Mission Critical Applications.
Transactions on Machine Learning and Artificial Intelligence, 2, 11-18. http://dx.doi.org/10.14738/tmlai.25.430
[29] Ebnenasir, A. (2005) Software Fault-Tolerance. Computer Science and Engineering Department, Michigan State Uni-
versity, East Lansing. http://www.cse.msu.edu/~cse870/Lectures/SS2005 /ft1.pdf
[30] Birman, K. (2005) Reliable Distributed Systems: Technologies, Web Services and Applications. Springer-Verlag, Ber-
lin.
[31] Obasuyi, G. and Sari, A. (2015) Security Challenges of Virtualization Hypervisors in Virtualized Hardware Environ-
ment. International Journal of Communications, Network and System Sciences, 8, 260-273.
http://dx.doi.org/10.4236/ijcns.2015.87026
[32] Avizienis, A. (1975) Architecture of Fault-Tolerant Computing Systems. Proceedings of the 1975 International Sym-
posium on Fault-Tolerant Computing, Paris, 18-20 June 1975, 3-16.
[33] Sari, A. (2015) A Review of Anomaly Detection Systems in Cloud Networks and Survey of Cloud Security Measures
in Cloud Storage Applications. Journal of Information Security, 6, 142-154. http://dx.doi.org/10.4236/jis.2015.62015
[34] Short, R.A. (1968) The Attainment of Reliable Digital Systems through the Use of Redundancy—A Survey. IEEE
Computer Group News, 2, 2-17.
[35] Sari, A. (2014) Influence of ICT Applications on Learning Process in Higher Education. Procedia—Social and Beha-
vioral Sciences, 116, 4939-4945. http://dx.doi.org/10.1016/j.sbspro.2014.01.1053
[36] Huang, M. and Bode, B. (2005) A Performance Comparison of Tree and Ring Topologies in Distributed Systems.
Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium, Denver, 4-8 April 2005,
258.1. http://dx.doi.org/10.1109/IPDPS.2005.57
[37] Huang, M. (2004) A Performance Comparison of Tree and Ring Topologies in Distributed System. Master’s Thesis.
www.osti.gov
[38] Minar, N. (2001) Distributed Systems Topologies: Part 1. http://openp2p.com
[39] Wiesmann, M., Pedone, F., Schiper, A., Kemme, B. and Alonso, G. (2000) Understanding Replication in Databases
and Distributed Systems. Research Supported by EPFL-ETHZ DRAGON Project and OFES.
[40] Herlihy, M. and Wing, J. (1990) Linearizability: A Correctness Condition for Concurrent Objects. ACM Transactions
on Programming Languages and Systems, 12, 463-492. http://dx.doi.org/10.1145/78969.78972
[41] Ahamad, M., Hutto, P.W., Neiger, G., Burns, J.E. and Kohli, P. (1994) Causal Memory: Definitions, Implementations
and Programming. TR GIT-CC-93/55, Georgia Institute of Technology, Atlanta.
[42] Rahnama, B., Sari, A. and Makvandi, R. (2013) Countering PCIe Gen. 3 Data Transfer Rate Imperfection Using Serial
Data Interconnect. Proceedings of the 2013 International Conference on Technological Advances in Electrical, Elec-
tronics and Computer Engineering (TAEECE), Konya, 9-11 May 2013, 579-582.
http://dx.doi.org/10.1109/TAEECE.2013.6557339
[43] Budhiraja, N., Marzullo, K., Schneider, F.B. and Toueg, S. (1993) The Primary-Backup Approach. In: Mullender, S.,
Ed., Distributed Systems, ACM Press, New York, 199-216.
[44] Gifford, D.K. (1979) Weighted Voting for Replicated Data. Proceedings of the Seventh ACM Symposium on Operating
Systems Principles, Pacific Grove, 10-12 December 1979, 150-162. http://dx.doi.org/10.1145/800215.806583
[45] Osrael, J., Froihofer, L., Goeschka, K.M., Beyer, S., Galdamez, P. and Munoz, F. (2006) A System Architecture for
Enhanced Availability of Tightly Coupled Distributed Systems. Proceedings of the First International Conference on
Availability, Reliability and Security, Vienna, 20-22 April 2006.
[46] Cao, H.H. and Zhu, J.M. (2008) An Adaptive Replicas Creation Algorithm with Fault Tolerance in the Distributed
Storage Network. Proceedings of the Second International Symposium on Intelligent Information Technology Applica-
tion, Shanghai, 20-22 December 2008, 738-741.
[47] Shye, A., Blomstedt, J., Moseley, T., Reddi, V. and Connors, D.A. (2008) PLR: A Software Approach to Transient
Fault Tolerance for Multicore Architectures. IEEE Transactions on Dependable and Secure Computing, 6, 135-148.
http://dx.doi.org/10.1109/TDSC.2008.62
[48] Agarwal, V. (2004) Fault Tolerance in Distributed Systems. Indian Institute of Technology, Kanpur.
www.cse.iitk.ac.in/report-repository
[49] Jung, H., Shin, D., Kim, H. and Lee, H.Y. (2005) Design and Implementation of Multiple Fault Tolerant MPI over
Myrinet (M3). In: Proceedings of the 2005 ACM/IEEE Conference on Supercomputing, IEEE Computer Society,
Washington DC, 32. http://dx.doi.org/10.1109/SC.2005.22
[50] Elnozahy, M., Alvisi, L., Wang, Y.M. and Johnson, D.B. (1996) A Survey of Rollback-Recovery Protocols in Message
Passing Systems. Technical Report CMU-CS-96-81, School of Computer Science, Carnegie Mellon University, Pitts-
burgh.
481
A. Sari, M. Akkaya
[51] Alvisi, L. and Marzullo, K. (1995) Message Logging: Pessimistic, Optimistic, and Causal. Proceedings of the 15th In-
ternational Conference on Distributed Computing, Systems (ICDCS 1995), Vancouver, 30 May-2 Jun 1995, 229-236.
http://dx.doi.org/10.1109/ICDCS.1995.500024
[52] Garg, V.K. (2010) Implementing Fault-Tolerant Services Using Fused State Machines. Technical Report ECE-PDS-
2010-001, Parallel and Distributed Systems Laboratory, ECE Department, University of Texas, Austin.
[53] Xiong, N., Cao, M., He, J. and Shu, L. (2009) A Survey on Fault Tolerance in Distributed Network Systems. Proceed-
ings of the 2009 International Conference on Computational Science and Engineering, Vancouver, 29-31 August 2009,
1065-1070. http://dx.doi.org/10.1109/CSE.2009.497
[54] Tian, D., Wu, K. and Li, X. (2008) A Novel Adaptive Failure Detector for Distributed Systems. Proceedings of the
2008 International Conference on Networking, Architecture, and Storage, Chongqing, 12-14 June 2008, 215-221.
http://dx.doi.org/10.1109/NAS.2008.37
482