KEMBAR78
Distributed File Systems Concepts and e 61384 | PDF | Computer File | File System
0% found this document useful (0 votes)
155 views54 pages

Distributed File Systems Concepts and e 61384

This paper establishes a viewpoint that emphasizes the dispersed structure and decentralization of both data and control in the design of such systems. It defines the concepts of transparency, fault tolerance, and scalability. A survey of contemporary UNIX@-based systems illustrates the concepts.

Uploaded by

soniad1981
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
155 views54 pages

Distributed File Systems Concepts and e 61384

This paper establishes a viewpoint that emphasizes the dispersed structure and decentralization of both data and control in the design of such systems. It defines the concepts of transparency, fault tolerance, and scalability. A survey of contemporary UNIX@-based systems illustrates the concepts.

Uploaded by

soniad1981
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

Distributed File Systems: Concepts and Examples

ELIEZER LEVY and ABRAHAM SILBERSCHATZ


Department of Computer Sciences, University of Texas at Austin, Austin, Texas 78712-l 188

The purpose of a distributed file system (DFS) is to allow users of physically distributed
computers to share data and storage resources by using a common file system. A typical
configuration for a DFS is a collection of workstations and mainframes connected by a
local area network (LAN). A DFS is implemented as part of the operating system of each
of the connected computers. This paper establishes a viewpoint that emphasizes the
dispersed structure and decentralization of both data and control in the design of such
systems. It defines the concepts of transparency, fault tolerance, and scalability and
discusses them in the context of DFSs. The paper claims that the principle of distributed
operation is fundamental for a fault tolerant and scalable DFS design. It also presents
alternatives for the semantics of sharing and methods for providing access to remote files.
A survey of contemporary UNIX@-based systems, namely, UNIX United, Locus, Sprite,
Sun’s Network File System, and ITC’s Andrew, illustrates the concepts and demonstrates
various implementations and design alternatives. Based on the assessment of these
systems, the paper makes the point that a departure from the approach of extending
centralized file systems over a communication network is necessary to accomplish sound
distributed file system design.

Categories and Subject Descriptors: C.2.4 [Computer-Communication Networks]:


Distributed Systems-distributed applications; network operating systems; D.4.2
[Operating Systems]: Storage Management-allocation/deallocation strategies; storage
hierarchies; D.4.3 [Operating Systems]: File Systems Management-directory
structures; distributed file systems; file organization; maintenance; D.4.4 [Operating
Systems]: Communication Management-buffering; network communication; D.4.5
[Operating Systems]: Reliability-fault tolerance; D.4.7 [Operating Systems]:
Organization and Design-distributed systems; F.5 [Files]: Organization/structure
General Terms: Design, Reliability
Additional Key Words and Phrases: Caching, client-server communication, network
transparency, scalability, UNIX

INTRODUCTION discusses Distributed File Systems (DFSs)


The need to share resources in a commuter as
. the means of sharing storage space and
system arises due to economics or the na- data.
ture of some applications. In such cases, it A file system is a subsystem of an oper-
is necessary to facilitate sharing long-term ating system whose purpose is to provide
long-term storage. It does so by implement-
storage devices and their data. This paper
ing files-named objects that exist from
th& explicit creatidn until their explicit
@UNIX is a trademark of AT&T Bell Laboratories. destruction and are immune to temporary

Permission to copy without fee all or part of this material is granted provided that the copies are not made or
distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its
date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To
copy otherwise, or to republish, requires a fee and/or specific permission.
0 1990 ACM 0360-0300/90/1200-0321 $01.50

ACM Computing Surveys, Vol. 22, No. 4, December 1990


322 l E. Levy and A. Silberschatz
CONTENTS failures in the system. A DFS is a distrib-
uted implementation of the classical time-
sharing model of a file system, where mul-
tiple users share files and storage resources.
INTRODUCTION The UNIX time-sharing file system is usu-
1. TRENDS AND TERMINOLOGY ally regarded as the model [Ritchie and
2. NAMING AND TRANSPARENCY Thompson 19741. The purpose of a DFS is
2.1 Location Transparency and Independence
2.2 Naming Schemes
to support the same kind of sharing when
2.3 Implementation Techniques users are physically dispersed in a distrib-
3. SEMANTICS OF SHARING uted system. A distributed system is a col-
3.1 UNIX Semantics lection of loosely coupled machines-either
3.2 Session Semantics
3.3 Immutable Shared Files Semantics
a mainframe or a workstation-intercon-
3.4 Transaction-like Semantics nected by a communication network. Un-
4. REMOTE-ACCESS METHODS less specified otherwise, the network is a
4.1 Designing a Caching Scheme local area network (LAN). From the point
4.2 Cache Consistency of view of a specific machine in a distrib-
4.3 Comparison of Caching and Remote Service
5. FAULT TOLERANCE ISSUES
uted system, the rest of the machines and
5.1 Stateful Versus Stateless Service their respective resources are remote and
5.2 Improving Availability the machine’s own resources are local.
5.3 File Replication To explain the structure of a DFS, we
6. SCALABILITY ISSUES
need to define service, server, and client
6.1 Guidelines by Negative Examples
6.2 Lightweight Processes [Mitchell 19821. A service is a software
7. UNIX UNITED entity running on one or more machines
7.1 Overview and providing a particular type of function
7.2 Implementation-Newcastle Connection to a priori unknown clients. A server is the
7.3 Summary
8. LOCUS
service software running on a single ma-
8.1 Overview chine. A client is a process that can invoke
8.2 Name Structure a service using a set of operations that form
8.3 File Operations its client interface (see below). Sometimes,
8.4 Synchronizing Accesses to Files
a lower level interface is defined for the
8.5 Operation in a Faulty Environment
8.6 Summary actual cross-machine interaction. When
9. SUN NETWORK FILE SYSTEM the need arises, we refer to this interface as
9.1 Overview the intermachine interface. Clients imple-
9.2 NFS Services
ment interfaces suitable for higher level
9.3 Implementation
9.4 Summary applications or direct access by humans.
10. SPRITE Using the above terminology, we say a
10.1 Overview file system provides file services to clients.
10.2 Looking Up Files with Prefix Tables A client interface for a file service is formed
10.3 Caching and Consistency
10.4 summary
by a set of file operations. The most primi-
11. ANDREW tive operations are Create a file, Delete a
11.1 Overview file, Read from a file, and Write to a file.
11.2 Shared Name Space The primary hardware component a file
11.3 File Operations and Sharing Semantics
server controls is a set of secondary storage
11.4 Implementation
11.5 Summary devices (i.e., magnetic disks) on which files
12. OVERVIEW OF RELATED WORK are stored and from which they are re-
13. CONCLUSIONS trieved according to the client’s requests.
ACKNOWLEDGMENTS We often say that a server, or a machine,
REFERENCES
BIBLIOGRAPHY
stores a file, meaning the file resides on one
of its attached devices. We refer to the file
system offered by a uniprocessor, time-
sharing operating system (e.g., UNIX 4.2
BSD) as a conventional file system.

ACM Computing Surveys, Vol. 22, No. 4, December 1990


Distributed File Systems l 323

A DFS is a file system, whose clients, the survey paper by Tanenbaum and Van
servers, and storage devices are dispersed Renesse [ 19851, where the broader context
among the machines of a distributed sys- of distributed operating systems and com-
tem. Accordingly, service activity has to be munication primitives are discussed.
carried out across the network, and instead In light of the profusion of UNIX-based
of a single centralized data repository there DFSs and the dominance of the UNIX file
are multiple and independent storage de- system model, five UNIX-based systems
vices. As will become evident, the concrete are surveyed. The first part of the paper is
configuration and implementation of a independent of this choice as much as pas-
DFS may vary. There are configurations sible. Since a vast majorit,y of the actual
where servers run on dedicated machines, DFSs (and all systems surveyed and men-
as well as configurations where a machine tioned in this paper) have some relation to
can be both a server and a client. A DFS UNIX, however, it is inevitable that the
can be implemented as part of a distributed concepts are understood best in the UNIX
operating system or, alternatively, by a context. The choice of the five systems
software layer whose task is to manage and the order of their presentation demon-
the communication between conventional strate the evolution of DFSs in the last
operating systems and file systems. The decade.
distinctive features of a DFS are the Section 1 presents the terminology and
multiplicity and autonomy of clients and concepts of transparency, fault tolerance,
servers in the system. and scalability. Section 2 discusses trans-
The paper is divided into two parts. In parency and how it is expressed in naming
the first part, which includes Sections 1 to schemes in greater detail. Section 3 intro-
6, the basic concepts underlying the design duces notions that are important for the
of a DFS are discussed. In particular, alter- semantics of sharing files, and Section 4
natives and trade-offs regarding the design compares methods of caching and remote
of a DFS are pointed out. The second part service. Sections 5 and 6 discuss issues
surveys five DFSs: UNIX United [Brown- related to fault tolerance and scalability,
bridge et al. 1982; Randell 19831, Locus respectively, pointing out observations
[Popek and Walker 1985; Walker et al. based on the designs of the surveyed sys-
19831, Sun’s Network File System (NFS) tems. Sections 7-11 describe each of the
[Sandberg et al. 1985; Sun Microsystems five systems mentioned above, including
Inc. 19881, Sprite [Nelson et al., 1988; distinctive features of a system not related
Ousterhout et al. 19881, and Andrew to the issues presented in the first part.
[Howard et al. 1988; Morris et al. 1986; Each description is followed by a summary
Satyanarayanan et al. 19851. These systems of the prominent features of the corre-
exemplify the concepts and observations sponding system. A table compares the five
mentioned in the first part and demon- systems and concludes the survey. Many
strate various implementations. A point in important aspects of DFSs and systems are
the first part is often illustrated by referring omitted from this paper; thus, Section 12
to a later section covering one of the sur- reviews related work not emphasized in our
veyed systems. discussion. Finally, Section 13 provides
The fundamental concepts of a DFS can conclusions and a bibliography provides re-
be studied without paying significant atten- lated literature not directly referenced.
tion to the actual operating system of which
it is a component. The first part of the
1. TRENDS AND TERMINOLOGY
paper adopts this approach. The second
part reviews actual DFS architectures that Ideally, a DFS should look to its clients like
serve to demonstrate approaches to inte- a conventional, centralized file system.
gration of a DFS with an operating system That is, the multiplicity and dispersion of
and a communication network. To comple- servers and storage devices should be trans-
ment our discussion, we refer the reader to parent to clients. As will become evident,

ACM Computing Surveys, Vol. 22, No. 4, December 1990


324 l E. Levy and A. Silberschatz

transparency has many dimensions and de- Systems have bounded resources and can
grees. A fundamental property, called net- become completely saturated under in-
work transparency, implies that clients creased load. Regarding a file system, sat-
should be able to access remote files using uration occurs, for example, when a server’s
the same set of file operations applicable to CPU runs at very high utilization rate or
local files. That is, the client interface of a when disks are almost full. As for a DFS in
DFS should not distinguish between local particular, server saturation is even a bigger
and remote files. It is up to the DFS to threat because of the communication over-
locate the files and arrange for the trans- head associated with processing remote
port of the data. requests. Scalability is a relative property;
Another aspect of transparency is user a scalable system should react more grace-
mobility, which implies that users can log fully to increased load than a nonscalable
in to any machine in the system; that is, one will. First, its performance should
they are not forced to use a specific ma- degrade more moderately than that of a
chine. A transparent DFS facilitates user nonscalable system. Second, its resources
mobility by bringing the user’s environ- should reach a saturated state later, when
ment (e.g., home directory) to wherever he compared with a nonscalable system.
or she logs in. Even a perfect design cannot accommo-
The most important performance mea- date an ever-growing load. Adding new re-
surement of a DFS is the amount of time sources might solve the problem, but it
needed to satisfy service requests. In con- might generate additional indirect load on
ventional systems, this time consists of disk other resources (e.g., adding machines to a
access time and a small amount of CPU distributed system can clog the network
processing time. In a DFS, a remote access and increase service loads). Even worse,
has the additional overhead attributed to expanding the system can incur expensive
the distributed structure. This overhead design modifications. A scalable system
includes the time needed to deliver the re- should have the potential to grow without
quest to a server, as well as the time needed the above problems. In a distributed sys-
to get the response across the network tem, the ability to scale up gracefully is of
back to the client. For each direction, in special importance, since expanding the
addition to the actual transfer of the infor- network by adding new machines or inter-
mation, there is the CPU overhead of run- connecting two networks together is com-
ning the communication protocol software. monplace. In short, a scalable design should
The performance of a DFS can be viewed withstand high-service load, accommodate
as another dimension of its transparency; growth of the user community, and enable
that is, the performance of a DFS should simple integration of added resources.
be comparable to that of a conventional file Fault tolerance and scalability are mu-
system. tually related to each other. A heavily
We use the term fault tolerance in a loaded component can become paralyzed
broad sense. Communication faults, ma- and behave like a faulty component. Also,
chine failures (of type fail stop), storage shifting a load from a faulty component to
device crashes, and decays of storage media its backup can saturate the latter. Gener-
are all considered to be faults that should ally, having spare resources is essential for
be tolerated to some extent. A fault- reliability, as well as for handling peak
tolerant system should continue function- loads gracefully.
ing, perhaps in a degraded form, in the face An advantage of distributed systems over
of these failures. The degradation can be in centralized systems is the potential for fault
performance, functionality, or both but tolerance and scalability because of the
should be proportional, in some sense, to multiplicity of resources. Inappropriate de-
the failures causing it. A system that grinds sign can, however, obscure this potential
to a halt when a small number of its com- and, worse, hinder the system’s scalability
ponents fail is not fault tolerant. and make it failure prone. Fault tolerance
The capability of a system to adapt to and scalability considerations call for a de-
increased service load is called scalability. sign demonstrating distribution of control
ACM Computing Surveys, Vol. 22, No. 4, December 1990
Distributed File Systems l 325

and data. Any centralized entity, be it a numerical identifier, which in turn is


central controller or a central data reposi- mapped to disk blocks. This multilevel
tory, introduces both a severe point of mapping provides users with an abstraction
failure and a performance bottleneck. of a file that hides the details of how and
Therefore, a scalable and fault-tolerant where the file is actually stored on the disk.
DFS should have multiple and independent In a transparent DFS, a new dimension
servers controlling multiple and indepen- is added to the abstraction, that of hiding
dent storage devices. where in the network the file is located. In
The fact that a DFS manages a set of a conventional file system the range of the
dispersed storage devices is its key distin- name mapping is an address within a disk;
guishing feature. The overall storage space in a DFS it is augmented to include the
managed by a DFS consists of different and specific machine on whose disk the file is
remotely located smaller storage spaces. stored. Going further with the concept of
Usually there is correspondence between treating files as abstractions leads to the
these constituent storage spaces and sets of notion of file replication. Given a file name,
files. We use the term component unit to the mapping returns a set of the locations
denote the smallest set of files that can be of this file’s replicas [Ellis and Floyd
stored on a single machine, independently 19831. In this abstraction, both the exist-
from other units. All files belonging to the ence of multiple copies and their locations
same component unit must reside in the are hidden.
same location. We illustrate the definition In this section, we elaborate on transpar-
of a component unit by drawing an analogy ency issues regarding naming in a DFS.
from (conventional) UNIX, where multiple After introducing the properties in this
disk partitions play the role of distributed context, we sketch approaches to naming
storage sites. There, an entire removable and discuss implementation techniques.
file system is a component unit, since a file
system must fit within a single disk parti- 2.1 Location Transparency
tion [Ritchie and Thompson 19741. In all and Independence
five systems, a component unit is a partial
This section discusses transparency in the
subtree of the UNIX hierarchy.
context of file names. First, two related
Before we proceed, we stress that the
distributed nature of a DFS is fundamental notions regarding name mappings in a DFS
to our view. This characteristic lays the need to be differentiated:
foundation for a scalable and fault-tolerant l Location Transparency. The name of a
system. Yet, for a distributed system to be file does not reveal any hint as to its
conveniently used, its underlying dispersed physical storage location.
structure and activity should be made l Location Independence. The name of a
transparent to users. We confine ourselves file need not be changed when the file’s
to discussing DFS designs in the context of physical storage location changes.
transparency, fault tolerance, and scalabil-
ity. The aim of this paper is to develop an Both definitions are relative to the dis-
understanding of these three concepts on cussed level of naming, since files have
the basis of the experience gained with different names at different levels (i.e.,
contemporary systems. user-level textual names, and system-level
numerical identifiers). A location-indepen-
2. NAMING AND TRANSPARENCY
dent naming scheme is a dynamic mapping,
since it can map the same file name to
Naming is a mapping between logical and different locations at two different in-
physical objects. Users deal with logical stances of time. Therefore, location inde-
data objects represented by file names, pendence is a stronger property than
whereas the system manipulates physical location transparency. Location indepen-
blocks of data stored on disk tracks. Usu- dence is often referred to as file migration
ally, a user refers to a file by a textual or file mobility. When referring to file mi-
name. The latter is mapped to a lower-level gration or mobility, one implicitly assumes
ACM Computing Surveys, Vol. 22, No. 4, December 1990
326 l E. Levy and A. Silberschatz
that the movement of files is totally trans- l Location independence separates the
parent to users. That is, files are migrated naming hierarchy from the storage de-
by the system without the users being vices hierarchy and the interserver struc-
aware of it. ture. By contrast, if only location
In practice, most of the current file sys- transparency is used (although names
tems (e.g., Locus, NFS, Sprite) provide a are transparent), one can easily expose
static, location-transparent mapping for the correspondence between component
user-level names. The notion of location units and machines. The machines are
independence is, however, irrelevant for configured in a pattern similar to the
these systems. Only Andrew and some ex- naming structure. This may restrict the
perimental file systems support location architecture of the system unnecessarily
independence and file mobility (e.g., Eden and conflict with other considerations. A
[Almes et al., 1983; Jessop et al. 19821). server in charge of a root directory is
Andrew supports file mobility mainly for an example for a structure dictated by
administrative purposes. A protocol pro- the naming hierarchy and contradicts
vides migration of Andrew’s component decentralizat,ion guidelines. An excellent
units upon explicit request without chang- example of separation of the service
ing the user-level or the low-level names of structure from the naming hierarchy can
the corresponding files (see Section 11.2 for be found in the design of the Grapevine
details). system [Birrel et al. 1982; Schroeder et
There are few other aspects that can al. 19841.
further differentiate and contrast location
The concept of file mobility deserves more
independence and location transparency:
attention and research. We envision future
DFS that supports location independence
l Divorcing data from location, as exhib- completely and exploits the flexibility that
ited by location independence, provides this property entails.
a better abstraction for files. Location-
independent files can be viewed as logical
data containers not attached to a specific 2.2 Naming Schemes
storage location. If only location trans- There are three main approaches to naming
parency is supported, however, the file schemes in a DFS [Barak et al. 19861. In
name still denotes a specific, though hid- the simplest approach, files are named by
den, set of physical disk blocks. some combination of their host name and
l Location transparency provides users local name, which guarantees a unique sys-
with a convenient way to share data. tem-wide name. In Ibis for instance, a
Users may share remote files by naming file is uniquely identified by the name
them in a location-transparent manner hostzlocal-name, where local name is a
as if they were local. Nevertheless, shar- UNIX-like path [Tichy and Ruan 19841.
ing the storage space is cumbersome, This naming scheme is neither location
since logical names are still statically at- transparent nor location independent.
tached to physical storage devices. Loca- Nevertheless, the same file operations can
tion independence promotes sharing the be used for both local and remote files; that
storage space itself, as well as sharing the is, at least the fundamental network trans-
data objects. When files can be mobilized, parency is provided. The structure of the
the overall, systemwide storage space DFS is a collection of isolated component
looks like a single, virtual resource. A units that are entire conventional file sys-
possible benefit of such a view is the tems. In this first approach, component
ability to balance the utilization of disks units remain isolated, although means are
across the system. Load balancing of the provided to refer to a remote file. We do
servers themselves is also made possible not consider this scheme any further in this
by this approach, since files can be mi- paper.
grated from heavily loaded servers to The second approach, popularized by
lightly loaded ones. Sun’s NFS, provides means for individual

ACM Computing Surveys, Vol. 22, No. 4, December 1990


Distributed File Systems l 327
machines to attach (or mount in UNIX 2.3.1 Pathname Translation
jargon) remote directories to their local
The mapping of textual names to low-level
name spaces. Once a remote directory is
identifiers is typically done by a recursive
attached locally, its files can be named in a
lookup procedure based on the one used in
location-transparent manner. The result-
conventional UNIX [Ritchie and Thomp-
ing name structure is versatile; usually it is
son 19741. We briefly review how this
a forest of UNIX trees, one for each ma-
procedure works in a DFS scenario by il-
chine, with some overlapping (i.e., shared)
lustrating the lookup of the textual name
subtrees. A prominent property of this
/a/b/c of Figure 1. The figure shows a par-
scheme is the fact that the shared name
tial name structure constructed from three
space may not be identical at all the ma-
component units using the third scheme
chines. Usually this is perceived as a serious
mentioned above. For simplicity, we as-
disadvantage; however, the scheme has the
sume that the location table is available to
potential for creating customized name
all the machines. Suppose that the lookup
spaces for individual machines.
is initiated by a client on machinel. First,
Total integration between the compo-
the root directory ‘1’ (whose low-level iden-
nent file systems is achieved using the third
tifier and hence its location on disk is
approach-a single global name structure
known in advance) is searched to find the
that spans all the files in the system. Con-
entry with the low-level identifier of a.
sequently, the same name space is visible to
Once the low-level identifier of a is found,
all clients. Ideally, the composed file system
the directory a itself can be fetched from
structure should be isomorphic to the struc-
disk. Now, b is looked for in this directory.
ture of a conventional file system. In prac-
Since b is remote, an indication that b
tice, however, there are many special files
belongs to cu2 is recorded in the entry of b
that make the ideal goal difficult to attain.
in the directory a. The component of the
(In UNIX, for example, I/O devices are
name looked up so far is stripped off and
treated as ordinary files and are repre-
the remainder (/b/c) is passed on to
sented in the directory Jdev; object code of
system programs reside in the directory machine2. On machine2, the lookup is con-
tinued and eventually machine3 is con-
/bin. These are special files specific to a
particular hardware setting.) Different var- tacted and the low-level identifier of /a/b/c
is returned to the client. All five systems
iations of this approach are examined in
mentioned in this paper use a variant of
the sections on UNIX United, Locus,
Sprite, and Andrew. this lookup procedure. Joining component
units together and recording the points
All important criterion for evaluating the
where they are joined (e.g., b is such a point
above naming structures is administrative
complexity. The most complex structure in the above example) is done by the mount
and most difficult to maintain is the NFS mechanism discussed below.
structure. The effects of a failed machine, There are few options to consider when
machine boundaries are crossed in the
or taking a machine off-line, are that some
arbitrary set of directories on different. course of a pat,hname traversal. We refer
again to the above example. Once
machines becomes unavailable. Likewise,
machine2 is contacted, it can look up b and
migrating files from one machine to an-
respond immediately to machinel. Alter-
other requires changes in the name spaces
natively, machine2 can initiate the contact
of all the affected machines. In addition, a
with machine3 on behalf of the client on
separate accreditation mechanism had to
machinel. This choice has ramifications on
be devised for controlling which machine is
fault tolerance that are discussed in Section
allowed to attach which directory to its
5.2. Among the surveyed systems, only in
name space.
UNIX United are lookups forwarded from
2.3 Implementation Techniques
machine to machine on behalf of the lookup
initiator. If machine2 responds immedi-
This section reviews commonly used tech- ately, it can either respond with the low-
niques related to naming. level identifier of b or send as a reply the

ACM Computing Surveys, Vol. 22, No. 4, December 1990


328 l E. Levy and A. Silberschatz

component unit server 7


cul machine1
cu2 machine2
cu3 machine3

Location Table

Figure 1. Lookup example.

ent.ire parent directory of b. In the former ond identifies the particular file within the
it is the server (machine2 in the example) unit. Variants with more parts are possible.
that performs the lookup, whereas in the The invariant of structured names is, how-
latter it is the client that initiates the ever, that individual parts of the name are
lookup that actually searches the directory. unique for all times only within the context
In case the server’s CPU is loaded, this of the rest of the parts. Uniqueness at all
choice is of consequence. In Andrew and times can be obtained by not reusing a
Locus, clients perform the lookups; in NFS name that is still used, or by allocating a
and Sprite the servers perform it. sufficient number of bits for the names
(this method is used in Andrew), or by using
2.3.2 Structured Identifiers a time stamp as one of the parts of the
name (as done in Apollo Domain [Leach et
Implementing transparent naming requires al. 19821).
the provision of the mapping of a file name To enhance the availability of the crucial
to its location. Keeping this mapping man- name to location mapping information,
ageable calls for aggregating sets of files methods such as replicating it or caching
into component units and providing the parts of it locally by clients are used. As
mapping on a component unit basis rather was noted, location independence means
than on a single file basis. Typically, struc- that the mapping changes in time and,
tured identifiers are used for this aggrega- hence, replicating the mapping makes up-
tion. These are bit strings that usually have dating the information consistently a com-
two parts. The first part identifies the com- plicated matter. Structured identifiers are
ponent unit to which file belongs; the sec- location independent; they do not mention

ACM Computing Surveys, Vol. 22, No. 4, December 1990


Distributed File Systems 8 329
servers’ locations at all. Hence, these iden- cache location information from servers
tifiers can be replicated and cached freely and treat this information as hints (see
without being invalidated by migration of Section 11.4). Sprite uses an effective form
component units. A smaller, second level of of hints called prefix tables and resorts to
mapping that maps component units to broadcasting when the hint is wrong (see
locations is the only information that does Section 10.2). The location mechanism of
change when files migrate. The usage of Apollo Domain is based on hints and heu-
the techniques of aggregation of files into ristics [Leach et al. 19821. The Grapevine
component units and lcwer-level, location- mail system counts on hints to locate
independent file identifiers is exempli- mailboxes of mail recipients [Birrel et al.
fied in Andrew (Section 11) and Locus 19821.
(Section 8).
We illustrate the above techniques with
the example in Figure 1. Suppose the path- 2.3.4 Mount Mechanism
name /a/b/c is translated to the structured, Joining remote file systems to create a
low-level identifier <cu3, ll>, where cu3 global name structure is often done by the
denotes that file’s component unit and 11 mount mechanism. In conventional UNIX,
identifies it in that unit. The only place the mount mechanism is used to join to-
where machine locations are recorded is in gether several self-contained file systems to
the location table. Hence, the correspon- form a single hierarchical name space
dence between /a/b/c and <cu3, ll> is not [Quarterman et al. 1985; R.itchie and
invalidated once cu3 is migrated to Thompson 19741. A mount operation binds
machine2; only the location table should be the root of one file system to a directory of
updated. another file system. The former file system
hides the subtree descending from the
2.3.3 Hints mounted-over directory and looks like an
A technique often used for location map- integral subtree of the latter file system.
ping in a DFS is that of hints [Lampson The directory that glues together the two
1983; Terry 19871. A hint is a piece of file systems is called a mount point. All
information that speeds up performance if mount operations are recorded by the op-
it is correct and does not cause any se- erating system kernel in a mount table. This
mantically negative effects if it is incorrect. table is used to redirect name lookups to
In essence, a hint improves performance the appropriate file systems. The same se-
similarly to cached information. A hint may mantics and mechanisms are used to mount
be wrong, however; therefore, its correct- a remote file system over a local one. Once
ness must be validated upon use. To illus- the mount is complete, files in the remote
trate how location information is treated as file system can be accessed locally as if they
hints, assume there is a location server that were ordinary descendants of the mount
always reflects the correct and complete point directory. The mount mechanism is
mapping of files to locations. Also assume used with slight variations in Locus, NFS,
that clients cache parts of this mapping Sprite, and Andrew. Section 9.2.1 presents
locally. The cached location information is a detailed example of the mount operation.
treated as a hint. If a file is found using the
hint, a substantial performance gain is ob-
3. SEMANTICS OF SHARING
tained. On the other hand, if the hint was
invalidated because the file had been mi- The semantics of sharing are important
grated, the client’s lookup would fail. Con- criteria for evaluating any file system that
sequently, the client must resort to the allows multiple clients to share files. It is a
more expensive procedure of querying the characterization of the system that speci-
location server; but, still, no semantically fies the effects of multiple clients accessing
negative effects are caused. Examples of a shared file simultaneously. In partic-
using hints abound: Clients in Andrew ular, these semantics should specify when

ACM Computing Surveys, Vol. 22, No. 4, December 1990


330 l E. L.evy and A. Silberschatz
modifications of data by a client are ob- tifact of UNIX and is needed primarily for
servable, if at all, by remote clients. compatibility of distributed UNIX systems
For the following discussion we need to with conventional UNIX software. Most
assume that a series of file accesses (i.e., DFSs try to emulate these semantics to
Reads and Writes) attempted by a client to some extent (e.g., Locus, Sprite) mainly
the same file are always enclosed between because of compatibility reasons.
the Open and Close operations. We denote
such a series of accesses as a file session. 3.2 Session Semantics
It should be realized that applications
that use the file system to store data and Writes to an open file are visible imme-
pose constraints on concurrent accesses in diately to local clients but are invisible to
order to guarantee the semantic consis- remote clients who have the same file
tency of their data (i.e., database applica- open simultaneously.
tions) should use special means (e.g., locks) Once a file is closed, the changes made to
for this purpose and not rely on the under- it are visible only in later starting ses-
lying semantics of sharing provided by the sions. Already open instances of the file
file system. do not reflect these changes.
To illustrate the concept, we sketch sev- According to these semantics, a file may
eral examples of semantics of sharing men- be temporarily associated with several (pos-
tioned in this paper. We outline the gist of sibly different) images at the same time.
the semantics and not the whole detail. Consequently, multiple clients are allowed
to perform both Read and Write accesses
3.1 UNIX Semantics concurrently on their image of the file,
without being delayed. Observe that when
Every Read of a file sees the effects of all a file is closed, all remote active sessions
previous Writes performed on that file in are actually using a stale copy of the file.
the DFS. In particular, Writes to an open Here, it is evident that application pro-
file by a client are visible immediately by grams that care about the serialization of
other (possibly remote) clients who have accesses (e.g., a distributed database appli-
this file open at the same time. cation) should coordinate their accesses
It is possible for clients to share the explicitly and not rely on these semantics,
pointer of current location into the file.
Thus, the advancing of the pointer by 3.3 Immutable Shared Files Semantics
one client affects all sharing clients.
-4 different, quite unique approach is that
Consider a sequence interleaving all the of immutable shared files [Schroeder et al.
accesses to the same file regardless of the 19851. Once a file is declared as shared by
identity of the issuing client. Enforcing its creator, it cannot be modified any more.
the above semantics guarantees that each An immutable file has two important prop-
successive access sees the effects of the ones erties: Its name may not be reused, and its
that precede it in that sequence. In a file contents may not be altered. Thus, the
system context, such an interleaving can be name of an immutable file signifies the
totally arbitrary, since, in contrast to da- fixed contents of the file, not the file as
tabase management systems, sequences of a container for variable information. The
accesses are not defined as transactions. implementation of these semantics in a dis-
These semantics lend themselves to an im- tributed system is simple since the sharing
plementation where a file is associated with is in read-only mode.
a single physical image that serves all ac-
cesses in some serial order (which is the 3.4 Transaction-Like Semantics
order captured in the above sequence).
Contention for this single image results in Identifying a file session with a transaction
clients being delayed. The sharing of the yields the following, familiar semantics:
location pointer mentioned above is an ar- The effects of file sessions on a file and

ACM Computing Surveys, Vol. 22, No. 4, December 1990


Distributed File Systems l 331

their output are equivalent to the effect and the clients. Every access is handled by
output of executing the same sessions in the server and results in network traffic.
some serial order. Locking a file for the For example, a Read corresponds to a
duration of a session implements these request message sent to the server and a
semantics. Refer to the rich literature on reply to the client with the requested
database management systems to under- data. A similar notion called Remote
stand the concepts of transactions and Open is defined in Howard et al. [1988].
locking [Bernstein et al. 19871. In the Cam- l Caching. If the data needed to satisfy the
bridge File Server, the beginning and end access request are not present locally, a
of a transaction are implicit in the Open copy of those data is brought from the
file, Close file operations, and transactions server to the client. Usually the amount
can involve only one file [Needham and of data brought over is much larger than
Herbert 19821. Thus, a file session in that the data actually requested (e.g., whole
system is actually a transaction. files or pages versus a few blocks). Ac-
Variants of UNIX and (to a lesser de- cesses are performed on the cached copy
gree) session semantics are the most in the client side. The idea is to retain
commonly used policies. An important recently accessed disk blocks in cache
trade-off emerges when evaluating these so repeated accesses to the same infor-
two extremes of sharing semantics. Sim- mation can be handled locally, without
plicity of a distributed implementation is additional network traffic. Caching
traded for the strength of the semantics’ performs best when the stream of file
guarantee. UNIX semantics guarantee the accesses exhibits locality of reference. A
strong effect of making all accesses see the replacement policy (e.g., Least Recently
same version of the file, thereby ensuring Used) is used to keep the cache size
that every access is affected by all previous bounded. There is no direct correspon-
ones. On the other hand, session semantics dence between accesses and traffic to
do not guarantee much when a file is ac- the server. Files are still identified, with
cessed concurrently, since accesses at dif- one master copy residing at the server
ferent machines may observe different machine, but copies of (parts of) the file
versions of the accessed file. The ramifica- are scattered in different caches. When a
tions on the ease of implementation are cached copy is modified, the changes need
discussed in the next section. to be reflected on the master copy and,
depending on the relevant sharing se-
mantics, on any other cached copies.
4. REMOTE-ACCESS METHODS
Therefore, Write accesses may incur sub-
Consider a client process that requests to stantial overhead. The problem of keep-
access (i.e., Read or Write) a remote file. ing the cached copies consistent with the
Assuming the server storing the file was master file is referred to as the cache
located by the naming scheme, the actual consistency problem [Smith 19821.
data transfer to satisfy the client’s request
It should be realized that there is a direct
for the remote access should take place.
analogy between disk access methods in
There are two complementary methods for
conventional file systems and remote ac-
handling this type of data transfer.
cess methods in DFSs. A pure remote serv-
l Remote Service. Requests for accesses ice method is analogous to performing a
are delivered to the server. The server disk access for each and every access re-
machine performs the accesses, and their quest. Similarly, a caching scheme in a DFS
results are forwarded back to the client. is an extension of caching or buffering tech-
There is a direct correspondence between niques in conventional file systems (e.g.,
accesses and traffic to and from the buffering block I/O in UNIX [McKusick et
server. Access requests are translated to al. 19841). In conventional file systems, the
messages for the servers, and server re- rationale behind caching is to reduce disk
plies are packed as messages sent back to I/O, whereas in DFSs the goal is to reduce

ACM Computing Surveys, Vol. 22, No. 4, December 1990


332 . E. Levy and A. Silberschatz
network traffic. For these reasons, a pure The choices for these decisions are inter-
remote service method is not practical. Im- twined and related to the selected sharing
plementations must incorporate some form semantics.
of caching for performance enhancement.
Many implementations can be thought of 4.1.1 Cache Unit Size
as a hybrid of caching and remote service.
In Locus and NFS, for instance, the imple- The granularity of the cached data can vary
mentation is based on remote service but is from parts of a file to an entire file. Usually,
augmented with caching for performance more data are cached than needed to satisfy
(see Sections 8.3, 8.4, and 9.3.3). On the a single access, so many accesses can be
other hand, Sprite’s implementation is served by the cached data. An early version
based on caching, but under certain circum- of Andrew caches entire files. Currently,
stances a remote service method is adopted Andrew still performs caching in big
(see Section 10.3). Thus, when we evalu- chunks (64Kb). The rest of the systems
ate the two methods we actually evaluate support caching individual blocks driven by
to what degree one method should be clients’ demand, where a block is the unit
emphasized over the other. of transfer between disk and main memory
An interesting study of the performance buffers (see sample sizes below). Increasing
aspects of the remote access problem can the caching unit increases the likelihood
be found in Cheriton and Zwaenepoel that data for the next access will be found
[ 19831. This paper evaluates to what extent locally (i.e., the hit ratio is increased); on
remote access (using the simplest remote the other hand, the time required for the
service paradigm) is more expensive than data transfer and the potential for consis-
local access. tency problems are increased, too. Selecting
The remote service method is straight- the unit of caching involves parameters
forward and does not require further expla- such as the network transfer unit and the
nation. Thus, the following material is Remote Procedure Call (RPC) protocol
primarily concerned with the method of service unit (in case an RPC protocol is
caching. used) [Birrel and Nelson 19841. The net-
work transfer unit is relatively small (e.g.,
Ethernet packets are about 1.5Kb), so big
4.1 Designing a Caching Scheme units of cached data need to be disassem-
bled for delivery and reassembled upon
The following discussion pertains to a (file
reception [Welch 19861.
data) caching scheme between a client’s
Typically, block-caching schemes use a
cache and a server. The latter is viewed as
technique called read-ahead. This tech-
a uniform entity and its main memory and
nique is useful when sequentially reading a
disk are not differentiated. Thus, we ab-
large file. Blocks are read from the server
stract the traditional caching scheme on
disk and buffered on both the server and
the server side, between its own cache and
client sides before they are actually needed
disk.
in order to speed up the reading.
A caching scheme in a DFS should
One advantage of a large caching unit is
address the following design decisions
reduced network overhead. Recall that run-
[Nelson et al. 19881:
ning communication protocols accounts for
The granularity of cached data. a substantial portion of this overhead.
Transferring data in bulks amortizes the
The location of the client’s cache (main protocol cost over many transfer units. At
memory or local disk). the sender side, one context switch (to load
How to propagate modifications of the communication software) suffices to
cached copies. format and transmit multiple packets. At
How to determine if a client’s cached data the receiver side, there is no need to ac-
are consistent. knowledge each packet individually.

ACM Computing Surveys, Vol. 22, No. 4, December 1990


Distributed File Systems l 333
Block size and the total cache size are action of sending dirty blocks to be written
important for block-caching schemes. In on the master copy.
UNIX-like systems, common block sizes The policy used to flush dirty blocks back
are 4Kb or 8Kb. For large caches (more to the server’s master copy has a critical
than lMb), large block sizes (more than effect on the system’s performance and re-
8Kb) are beneficial since the advantages liability. (In this section we assume caches
of large caching unit size are dominant are held in main memories.) The simplest
[Lazowska et al. 1986; Ousterhout et al. policy is to write data through to the serv-
19851. For smaller caches, large block sizes er’s disk as soon as it is written to any
are less beneficial because they result in cache. The advantage of the write-through
fewer blocks in the cache and most of method is its reliability: Little information
the cache space is wasted due to internal is lost when a client crashes. This policy
fragmentation. requires, however, that each Write access
waits until the information is sent to the
4.1.2 Cache Location server, which results in poor Write perfor-
Regarding the second decision, disk caches mance. Caching with write-through is
have one clear advantage-reliability. equivalent to using remote service for Write
Modifications to cached data are lost in a accesses and exploiting caching only for
crash if the cache is kept in volatile mem- Read accesses.
ory. Moreover, if the cached data are kept An alternate write policy is to delay up-
on disk, the data are still there during re- dates to the master copy. Modifications are
covery and there is no need to fetch them written to the cache and then written
again. On the other hand, main-memory through to the server later. This policy has
caches have several advantages. First, main two advantages over write-through. First,
memory caches permit workstations to be since writes are to the cache, Write accesses
diskless. Second, data can be accessed more complete more quickly. Second, data may
quickly from a cache in main memory than be deleted before they are written back, in
from one on a disk. Third, the server caches which case they need never be written at
(used to speed up disk I/O) will be in main all. Unfortunately, delayed-write schemes
memory regardless of where client caches introduce reliability problems, since un-
are located; by using main-memory caches written data will be lost whenever a client
on clients, too, it is possible to build a single crashes.
caching mechanism for use by both servers There are several variations of the
and clients (as it is done in Sprite). It turns delayed-write policy that differ in when to
out that the two cache locations emphasize flush dirty blocks to the server. One alter-
different functionality. Main-memory native is to flush a block when it is about
caches emphasize reduced access time; to be ejected from the client’s cache. This
disk caches emphasize increased reliability option can result in good performance, but
and autonomy of single machines. Notice some blocks can reside in the client’s cache
that the current technology trend is larger for a long time before they are written back
and cheaper memories. With large main- to the server [Ousterhout et al. 19851. A
memory caches, and hence high hit ratios, compromise between the latter alternative
the achieved performance speed up is pre- and the write-through policy is to scan the
dicted to outweigh the advantages of disk cache periodically, at regular intervals, and
caches. flush blocks that have been modified since
the last scan. Sprite uses this policy with a
4.1.3 Modification Policy 30-second interval.
Yet another variation on delayed-write,
In the sequel, we use the term dirty block called write-on-close, is to write data back
to denote a block of data that has been to the server when the file is closed. In
modified by a client. In the context of cach- cases of files open for very short periods or
ing, we use the term to flush to denote the rarely modified, this policy does not signif-

ACM Computing Surveys, Vol. 22, No. 4, December 1990


334 l E. Levy and A. Silberschatz

icantly reduce network traffic. In addition, tacts the server and checks whether the
the writ.e-on-close policy requires the clos- local data are consistent with the master
ing process to delay while the file is written copy. The frequency of the validity check
through, which reduces the performance is the crux of this approach and deter-
advantages of delayed-writes. The per- mines the resulting sharing semantics. It
formance advantages of this policy over can range from a check before every sin-
delayed-write with more frequent flushing gle access to a check only on first access
are apparent for files that are both open for to a file (on file Open). Every access that
long periods and modified frequently. is coupled with a validity check is de-
As a reference, we present data regarding layed, compared with an access served
the utility of caching in UNIX 4.2 BSD. immediately by the cache. Alternatively,
UNIX 4.2 BSD uses a cache of about 400Kb a check can be initiated every fixed inter-
holding different size blocks (the most com- val of time. Usually the validity check
mon size is 4Kb). A delayed-write policy involves comparing file header informa-
with 30-second intervals is used. A miss tion (e.g., time stamp of the last update
ratio (ratio of the number of real disk I/O maintained as i-node information in
to logical disk accesses) of 15 percent is UNIX). Depending on its frequency, this
reported in McKusick et al. [1984], and of kind of validity check can cause severe
50 percent in Ousterhout et al. [1985]. The network traffic, as well as consume pre-
latter paper also provides the following sta- cious server CPU time. This phenome-
tistics, which were obtained by simulations non was the cause for Andrew designers
on UNIX: A 4Mb cache of 4Kb blocks to withdraw from this approach (Howard
eliminates between 65 and 90 percent of all et al. [ 19881 provide detailed performance
disk accesses for file data. A write-through data on this issue).
policy resulted in the highest miss ratio. l Server-initiated approach. The server
Delayed-write policy with flushing when records for each client the (parts of) files
the block is ejected from cache had the the client caches. Maintaining informa-
lowest miss ratio. tion on clients has significant fault tol-
There is a tight relation between the erance implications (see Section 5.1).
modification policy and semantics sharing. When the server detects a potential for
Write-on-close is suitable for session se- inconsistency, it must now react. A po-
mantics. By contrast, using any delayed- tential for inconsistency occurs when a
write policy, when situations of files that file is cached in conflicting modes by two
are updated concurrently occur frequently different clients (i.e., at least one of the
in conjunction with UNIX semantics, is not clients specified a Write mode). If session
reasonable and will result in long delays semantics are implemented, whenever a
and complex mechanisms. A write-through server receives a request to close a file
policy is more suitable for UNIX semantics that has been modified, it should react by
under such circumstances. notifying the clients to discard their
cached data and consider it invalid.
4.1.4 Cache Validation Clients having this file open at that time,
discard their copy when the current ses-
A client is faced with the problem of decid-
sion is over. Other clients discard their
ing whether or not its locally cached copy
copy at once. Under session semantics,
of the data is consistent with the master
the server need not be informed about
copy. If the client determines that its
Opens of already cached files. The server
cached data is out of date, accesses can no
is informed about the Close of a writing
longer be served by that cached data. An
session, however. On the other hand, if a
up-to-date copy of the data must be brought
more restrictive sharing semantics is im-
over. There are basically two approaches to
plemented, like UNIX semantics, the
verifying the validity of cached data:
server must be more involved. The server
l Client-initiated approach. The client in- must be notified whenever a file is
itiates a validity check in which it con- opened, and the intended mode (Read or

ACM Computing Surveys, Vol. 22, No. 4, December 1990


Distributed File Systems l 335

Write) must be indicated. Assuming such modes. In addition, once a cached copy is
notification, the server can act when it modified, the changes need to be propa-
detects a file that is opened simultane- gated immediately to the rest of the
ously in conflicting modes by disabling cached copies. Frequent Writes can gen-
caching for that particular file (as done erate tremendous network traffic and
in Sprite). Disabling caching results in cause long delays before requests are sat-
switching to a remote service mode of isfied. This is why implementations (e.g.,
operation. Sprite) disable caching altogether and re-
A problem with the server-initiated ap- sort to remote service once a file is con-
proach is that it violates the traditional currently open in conflicting modes.
client-server model, where clients initiate Observe that such an approach implies
activities by requesting service. Such vi- some form of a server-initiated validation
olation can result in irregular and com- scheme, where the server makes a note of
plex code for both clients and servers. all Open calls. As was stated, UNIX se-
mantics lend themselves to an implemen-
In summary, the choice is longer accesses tation where a file is associated with a
and greater server load using the former single physical image. A remote service
method versus the fact that the server approach, where all requests are directed
maintains information on its clients using and served by a single server, fits nicely
the latter. with these semantics.
l The immutable shared files semantics
4.2 Cache Consistency were invented for a whole file caching
Before delving into the evaluation and com- scheme [Schroeder et al. 19851. With
parison of remote service and caching, we these semantics, the cache consistency
relate these remote access methods to the problem vanishes totally.
examples of sharing semantics introduced l Transactions-like semantics can be im-
in Section 3. plemented in a straightforward manner
using locking, when all the requests for
l Session semantics are a perfect match for the same file are served by the same
caching entire files. Read and Write server on the same machine as done in
accesses within a session can be handled remote service.
by the cached copy, since the file can be
associated with different images accord-
ing to the semantics. The cache consis- 4.3 Comparison of Caching
tency problem diminishes to propagating and Remote Service
the modifications performed in a session
to the master copy at the end of a session. Essentially, the choice between caching and
This model is quite attractive since it has remote service is a choice between potential
simple implementation. Observe that for improved performance and simplicity.
coupling these semantics with caching We evaluate the trade-off by listing the
parts of files may complicate matters, merits and demerits of the two methods.
since a session is supposed to read the l When caching is used, a substantial
image of the entire file that corresponds amount of the remote accesses can be
to the time it was opened. handled efficiently by the local cache.
l A distributed implementation of UNIX Capitalizing on locality in file access pat-
semantics using caching has serious con- terns makes caching even more attrac-
sequences. The implementation must tive. Ramifications can be performance
guarantee that at all times only one client transparency: Most of the remote ac-
is allowed to write to any of the cached cesses will be served as fast as local ones.
copies of the same file. A distributed con- Consequently, server load and network
flict resolution scheme must be used in traffic are reduced, and the potential for
order to arbitrate among clients wishing scalability is enhanced. By contrast,
to access the same file in conflicting when using the remote service method,

ACM Computing Surveys, Vol. 22, No. 4, December 1990


336 . E. Levy and A. Silberschatz
each remote access is handled across the intermachine interface mirrors the local
network. The penalty in network traffic, client-file system interface.
server load, and performance is obvious.
l Total network overhead in transmitting 5. FAULT TOLERANCE ISSUES
big chunks of data, as done in caching, is
Fault tolerance is an important and broad
lower than when series of short responses
subject in the context of DFS. In this
to specific requests are transmitted (as in
section we focus on the following fault
the remote service method).
tolerance issues. In Section 5.1 we examine
Disk access routines on the server may two service paradigms in the context of
be better optimized if it is known that faults occurring while servicing a client. In
requests are always for large, contiguous Section 5.2 we define the concept of avail-
segments of data rather than for random ability and discuss how to increase the
disk blocks. This point and the previous availability of files. In Section 5.3 we review
one indicate the merits of transferring file replication as another means for en-
data in bulk, as done in Andrew. hancing availability.
The cache consistency problem is the
major drawback to caching. In access 5.1 Stateful Versus Stateless Service
patterns that exhibit infrequent writes,
When a server holds on to information on
caching is superior. When writes are fre-
its clients between servicing their requests,
quent, however, the mechanisms used to
we say the server is stateful. Conversely,
overcome the consistency problem incur
when the server does not maintain any
substantial overhead in terms of perfor-
information on a client once it finished
mance, network traffic, and server load.
servicing its request, we say the server is
It is hard to emulate the sharing seman- stateless.
tics of a centralized system in a system The typical scenario of a stateful file
using caching as its remote access service is as follows. A client must perform
method. The problem is the cache consis- an Open on a file before accessing it. The
tency; namely, the fact that accesses are server fetches some information about the
directed to distributed copies, not to a file from its disk, stores it in its memory,
central data object. Observe that the two and gives the client some connection iden-
caching-oriented semantics, session se- tifier that is unique to the client and the
mantics and immutable shared files open file. (In UNIX terms, the server
semantics, are not restrictive and do not fetches the i-node and gives the client a file
enforce serializability. On the other hand, descriptor, which serves as an index to an
when using remote service, the server in-core table of i-nodes.) This identifier is
serializes all accesses and, hence, is able used by the client for subsequent accesses
to implement any centralized sharing until the session ends. Typically, the iden-
semantics. tifier serves as an index into in-memory
To use caching and benefit from its mer- table that records relevant information the
its, clients must have either local disks server needs to function properly (e.g.,
or large main memories. Clients without timestamp of last modification of the cor-
disks can use remote-service methods responding file and its access rights). A
without any problems. stateful service is characterized by a virtual
Since, for caching, data are transferred circuit between the client and the server
en masse between the server and client, during a session. The connection identifier
and not in response to the specific needs embodies this virtual circuit. Either upon
of a file operation, the lower interma- closing the file or by a garbage collection
chine interface is quite different from the mechanism, the server must reclaim the
upper client interface. The remote ser- main-memory space used by clients that
vice paradigm, on the other hand, is just are no longer active.
an extension of the local file system in- The advantage of stateful service is per-
terface across the network. Thus, the formance. File information is cached in

ACM Computing Surveys, Vol. 22, No. 4, December 1990


Distributed File Systems l 337
main memory and can be easily accessed DFS. First, since each request identifies the
using the connection identifier, thereby target file, a uniform, systemwide, low-level
saving disk accesses. The key point regard- naming is advised. Translating remote to
ing fault tolerance in a stateful service local names for each request would imply
approach is t,he main-memory information even slower processing of the requests. Sec-
kept by the server on its clients. ond, since clients retransmit requests for
A stateless server avoids this state infor- files operations, these operations must be
mation by making each request self- idempotent. An idempotent operation has
contained. That is, each request identifies the same effect and returns the same output
the file and position in the file (for Read if executed several times consecutively.
and Write accesses) in full. The server need Self-contained Read and Write accesses are
not keep a table of open files in main mem- idempotent, since they use an absolute byte
ory, although this is usually done for effi- count to indicate the position within a file
ciency reasons. Moreover, there is no need and do not rely on an incremental offset
to establish and terminate a connection by (as done in UNIX Read and Write system
Open and Close operations. They are to- calls). Care must be taken when imple-
tally redundant, since each file operation menting destructive operations (such as
stands on its own and is not considered as Delete a file) to make them idempotent too.
part of a session. In some environments a stateful service
The distinction between stateful and is a necessity. If a Wide Area Network
stateless service becomes evident when (WAN) or Internetworks is used, it is pos-
considering the effects of a crash during a sible that messages are not received in the
service activity. A stateful server loses all order they were sent. A stateful, virtual-
its volatile state in a crash. A graceful re- circuit-oriented service would be preferable
covery of such a server involves restoring in such a case, since by the maintained
this state, usually by a recovery protocol state it is possible to order the messages
based on a dialog with clients. Less graceful correctly. Also observe that if the server
recovery implies abortion of the operations uses the server-initiated method for cache
that were underway when the crash oc- validation, it cannot provide stateless serv-
curred. A different problem is caused by ice since it maintains a record of which files
client failures. The server needs to become are cached by which clients. On the other
aware of such failures in order to reclaim hand, it is easier to build a stateless service
space allocated to record the state of than a stateful service on top of a datagram
crashed clients. These phenomena are communication protocol [Postel 19801.
sometimes referred to as orphan detection The way UNIX uses file descriptors and
and elimination. implicit offsets is inherently stateful. Serv-
A stateless server avoids the above prob- ers must maintain tables to map the file
lems, since a newly reincarnated server can descriptors to i-nodes and store the current
respond to a self-contained request without offset within a file. This is why NFS, which
difficulty. Therefore, the effects of server uses a stateless service, does not use file
failures and recovery are almost not notice- descriptors and includes an explicit offset
able. From a client’s point of view, there is in every access (see Section 9.2.2).
no difference between a slow server and a
recovering server. The client keeps retrans- 5.2 Improving Availability
mitting its request if it gets no response.
Regarding client failures, no obsolete state Svobodova [1984] defines two file proper-
needs to be cleaned up on the server side. ties in the context of fault tolerance: “A file
The penalty for using the robust stateless is recoverable if is possible to revert it to an
service is longer request messages and earlier, consistent state when an operation
slower processing of requests, since there is on the file fails or is aborted by the client.
no in-core information to speed the pro- A file is called robust if it is guaranteed to
cessing. In addition, stateless service im- survive crashes of the storage device and
poses other constraints on the design of the decays of the storage medium.” A robust

ACM Computing Surveys, Vol. 22, No. 4, December 1990


338 l E. Levy and A. Silberschatz
file is not necessarily recoverable and vice client and the server machines. Identifying
versa. Different techniques must be used the server that stores the file and establish-
to implement these two distinct concepts. ing the client-server connection is more
Recoverable files are realized by atomic problematic. A file location mechanism is
update techniques. (We do not give account an important factor in determining the
of atomic updates techniques in this paper.) availability of files. Traditionally, locating
Robust files are implemented by redun- a file is done by a pathname traversal,
dancy techniques such as mirrored files and which in a DFS may cross machine bound-
stable storage [ Lampson 19811. aries several times and hence involve more
It is necessary to consider the additional than two machines (see Section 2.3.1). In
criterion of auailability. A file is called avail- principle, most systems (e.g., Locus, NFS,
able if it can be accessed whenever needed, Andrew) approach the problem by requir-
despite machine and storage device crashes ing that each component (i.e., directory) in
and communication faults. Availability is the pathname would be looked up directly
often confused with robustness, probably by the client. Therefore, when machine
because they both can be implemented by boundaries are crossed, the server in the
redundancy techniques. A robust file is client-server pair changes, but the client
guaranteed to survive failures, but it may remains the same. In UNIX United, par-
not be available until the faulty component tially because of routing concerns, this
has recovered. Availability is a fragile and client-server model is not preserved in the
unstable property. First, it is temporal; pathname traversal. Instead, the pathname
availability varies as the system’s state traversal request is forwarded from ma-
changes. Also, it is relative to a client; for chine to machine along the pathname,
one client a file may be available, whereas without involving the client machine each
for another client on a different machine, time.
the same file may be unavailable. Observe that if a file is located by path-
Replicating files enhances their availa- name traversal, the availability of a file
bility (see Section 5.3); however, merely depends on the availability of all the direc-
replicating file is not sufficient. There are tories in its pathname. A situation can arise
some principles destined to ensure in- whereby a file might be available to reading
creased availability of the files described and writing clients, but it cannot be located
below. by new clients since a directory in its path-
The number of machines involved in a name is unavailable. Replicating top-level
file operation should be minimal, since the directories can partially rectify the prob-
probability of failure grows with the num- lem, and is indeed used in Locus to increase
ber of involved parties. Most systems ad- the availability of files.
here to the client-server pair for all file Caching directory information can both
operations. (This refers to a LAN environ- speed up the pathname traversal and avoid
ment, where no routing is needed.) Locus the problem of unavailable directories in
makes an exception, since its service model the pathname (i.e., if caching occurs before
involves a triple: a client, a server, and a the directory in the pathname becomes un-
Centralized Synchronization site (CSS). available). Andrew and NFS use this tech-
The CSS is involved only in Open and nique. Sprite uses a better mechanism for
Close operations; but if the CSS cannot be quick and reliable pathname traversal. In
reached by a client, the file is not available Sprite, machines maintain prefix tables
to that particular client. In general, having that map prefixes of pathnames to the serv-
more than two machines involved in a file ers that store the corresponding component
operation can cause bizarre situations in units. Once a file in some component unit
which a file is available to some but not all is open, all subsequent Opens of files within
clients. that same unit address the right server
Once a file has been located there is no directly, without intermediate lookups at
reason to involve machines other than the other servers. This mechanism is faster and

ACM Computing Surveys, Vol. 22, No. 4, December 1990


Distributed File Systems l 339

guarantees better availability. (For com- must be preserved when accesses to replicas
plete description of the prefix table mech- are viewed as virtual accesses to their logi-
anism refer to Section 10.2.) cal files. The analogous database term is
One-Copy Serializability [Bernstein et al.
19871. Davidson et al. [1985] survey ap-
5.3 File Replication
proaches to replication for database sys-
Replication of files is a useful redundancy tems, where consistency considerations are
for improving availability. We focus on rep- of major importance. If consistency is not
lication of files on different machines of primary importance, it can be sacrificed
rather than replication on different media for availability and performance. This is an
on the same machine (such as mirrored incarnation of a fundamental trade-off in
disks [Lampson 19811). Multimachine the area of fault tolerance. The choice is
replication can benefit performance too, between preserving consistency at all costs,
since selecting a nearby replica to serve thereby creating a potential for indefinite
an access request. results in shorter service blocking, or sacrificing consistency under
time. some (we hope rare) circumstance of cat-
The basic requirement from a replication astrophic failures for the sake of guaran-
scheme is that different replicas of the same teed progress. We illustrate this trade-off
file reside on failure-independent ma- by considering (in a conceptual manner)
chines. That is, the availability of one rep- the problem of updating a set of replicas of
lica is not affected by the availability of the the same file. The atomicity of such an
rest of the replicas. This obvious require- update is a desirable property; that is, a
ment implies that replicat,ion management situation in which both updated and not
is inherently a location-dependent activity. updated replicas serve accesses should be
Provisions for placing a replica on a partic- prevented. The only way to guarantee the
ular machine must be available. atomicity of such an update is by using a
It, is desirable to hide the details of rep- commit protocol (e.g., Two-phase commit),
lication from users. It is the task of the which can lead to indefinite blocking in the
naming scheme to map a replicated file face of machine and network failures
name to a particular replica. The existence [Bernstein et al. 19871. On the other hand,
of replicas should be invisible to higher if only the available replicas are updated,
levels. At some level, however, the replicas progress is guaranteed; stale replicas,
must be distinguished from one another by however, are present.
having different lower level names. This In most cases, the consistency of file data
can be accomplished by first mapping a file cannot be compromised, and hence the
name to an entity that is abie to differen- price paid for increased availability by
tiate the replicas (as done in Locus). An- replication is a complicated update prop-
other t,ransparency issue is providing agation protocol. One case in which consis-
replication control at higher levels. Repli- tency can be traded for performance, as
cation control includes determining the de- well as availability, is replication of the
gree of replication and placement of location hints discussed in Section 2.3.2.
replicas. Under certain circumstances, it is Since hints are validated upon use, their
desirable to expose these details to users. replication does not require maintaining
Locus, for instance, provides users and sys- their consistency. When a location hint is
tem administrators with mechanism to correct, it results in quick location of the
control the rephcation scheme. corresponding file without relying on a lo-
The main problem associated with rep- cation server. Among the surveyed systems,
licas is their update. From a user’s point of Locus uses replication extensively and sac-
view, replicas of a file denote the same rifices consistency in a partitioned environ-
logical entity; thus, an update to any replica ment for the sake of availability of files for
must be reflect,ed on all other replicas. More both Read and Write accesses (see Section
precisely, the relevant sharing semantics 8.5 for details).

ACM Computing Surveys, Vol. 22, No. 4, December 1990


340 l E. Levy and A. Silberschatz
Facing the problems associated with 6.1 we discuss several designs that pose
maintaining the consistency of replicas, a problems and propose possible solutions,
popular compromise is read-only replica- all in the context of scalability. In Section
tion. Files known to be frequently read and 6.2 we describe an implementation tech-
rarely modified are replicated using this nique, Light Weight Processes, essential
restricted variant of replication. Usually, for high-performance and scalable designs.
only one primary replica can be modified,
and the propagation of the updates involves
6.1 Guidelines by Negative Examples
either taking the file off line or using some
costly procedure that guarantees atomicity Barak and Kornatzky [1987] list several
of the updates. Files containing the object principles for designing very large-scale
code of system programs are good candi- systems. The first is called Bounded
dates for this kind of replication, as are Resources: “The service demand from any
system data files (e.g., location databases component of the system should be
and user registries). bounded by a constant. This constant is
As an illustration of the concepts dis- independent of the number of nodes in the
cussed above, we describe the replication system.” Any server whose load is propor-
scheme in Ibis, which is quite unique [Tichy tional to the size of the system is destined
and Ruan 19841. Ibis uses a variation of the to become clogged once the system grows
primary copy approach. The domain of the beyond a certain size. Adding more re-
name mapping is a pair: primary replica sources will not alleviate the problem. The
identifier and local replica identifier, if capacity of this server simply limits the
there is one. (If there is no replica locally, growth of the system. This is why the CSS
a special value is returned.) Thus, the map- of Locus is not a scalable design. In Locus,
ping is relative to a machine. If the local every filegroup (the Locus component unit,
replica is the primary one, the pair contains which is equivalent to a UNIX removable
two identical identifiers. Ibis supports file system) is assigned a CSS, whose re-
demand replication, which is an automatic sponsibility it is to synchronize accesses to
replication control policy (similar to whole- files in that filegroup. Every Open request
file caching). Demand replication means to a file within that filegroup must go
that reading a nonlocal replica causes it to through this machine. Beyond a certain
be cached locally, thereby generating a new system size, CSSs of frequently accessed
nonprimary replica. Updates are performed filegroups are bound to become a point of
only on the primary copy and cause all congestion, since they would need to satisfy
other replicas to be invalidated by sending a growing number of clients.
appropriate messages. Atomic and serial- The principle of bounded resources can
ized invalidation of all nonprimary replicas be applied to channels and network traffic,
is not guaranteed. Hence, it is possible that too, and hence prohibits the use of broad-
a stale replica is considered valid. Consis- casting. Broadcasting is an activity that
tency of replicas is sacrificed for a simple involves every machine in the network.
update protocol. To satisfy remote Write A mechanism that relies on broadcast-
accesses, the primary copy is migrated to ing is simply not realistic for large-scale
the requesting machine. systems.
The third example combines aspects of
6. Scalability Issues
scalability and fault tolerance. It was al-
ready mentioned that if a stateless service
Very large-scale DFSs, to a great extent, is used, a server need not detect a client’s
are still visionary. Andrew is the closest crash nor take any precautions because of
system to be classified as a very large-scale it. Obviously this is not the case with state-
system with a planned configuration of ful service, since the server must detect
thousands of workstations. There are no clients’ crashes and at least discard the
magic guidelines to ensure the scalability state it maintains for them. It is interesting
of a system. Examples of nonscalable de- to contrast the ways MOS and Locus
signs, however, are abundant. In Section reclaim obsolete state storage on servers
ACM Computing Surveys, Vol. 22, No. 4, December 1990
Distributed File Systems l 341

[Barak and Litman 1985; Barak and machines violates functional symmetry.
Paradise 19861. Autonomy and symmetry are, however, im-
The approach taken in MOS is garbage portant goals to which to aspire.
collection. It is the client’s responsibility to An important aspect of decentralization
set, and later reset, an expiration date on is system administration. Administrative
state information the servers maintain for responsibilities should be delegated to en-
it. Clients reset this date whenever they courage autonomy and symmetry, without
access the server or by special, infrequent disturbing the coherence and uniformity of
messages. If this date has expired, a the distributed system. Andrew and Apollo
periodic garbage collector reclaims that Domain support decentralized system man-
storage. This way, the server need not de- agement [Leach et al. 19851.
tect clients’ crashes. By contrast, Locus The practical approximation to symmet-
invokes a clean-up procedure whenever a ric and autonomous configuration is clus-
server machine determines that a particu- tering, where a system is partitioned into
lar client machine is unavailable. Among a collection of semiautonomous clusters.
other things, this procedure releases space A cluster consists of a set of machines
occupied by the state of clients from the and a dedicated cluster server. To make
crashed machine. Detecting crashes can be cross-cluster file references relatively infre-
very expensive, since it is based on polling quent, most of the time, each machine’s
and time-out mechanisms that incur sub- requests should be satisfied by its own clus-
stantial network overhead. The scheme ter server. Such a requirement depends on
MOS uses requires tolerable and scalable the ability to localize file references and the
overhead, where every client signals a appropriate placement of component units.
bounded number of objects (the object it If the cluster is well balanced, that is, the
owns), whereas a failure detection mecha- server in charge suffices to satisfy a major-
nism is not scalable since it depends on the ity of the cluster demands, it can be used
size of the system. as a modular building block to scale up the
Network congestion and latency are system. Observe that clustering complies
major obstacles to large-scale systems. A with the Bounded Resources Principle. In
guideline worth pursuing is to minimize essence, clustering attempts to associate a
cross-machine interactions by means of server with a fixed set of clients and a set
caching, hints, and enforcement of relaxed of files they access frequently, not just with
sharing semantics. There is, however, a an arbitrary set of files. Andrew’s use of
trade-off between the strictness of the shar- clusters, coupled with read-only replication
ing semantics in a DFS and the network of key files, is a good example for a scalable
and server loads (and hence necessarily the clustering scheme.
scalability potential). The more stringent UNIX United emphasizes the concept of
the semantics, the harder it is to scale the autonomy. There, UNIX systems are joined
system up. together in a recursive manner to create a
Central control schemes and central re- larger global system [Randell 19831. Each
sources should not be used to build scalable component system is a complex UNIX sys-
(and fault-tolerant) systems. Examples of tem that can operate and be administered
centralized entities are central authentica- independently. Again, modular and auton-
tion server, central naming server, and cen- omous components are combined to create
tral file server. Centralization is a form of a large-scale system. The emphasis on
functional asymmetry among the machines autonomy results in some negative effects,
comprising the system. The ideal alterna- however, since component boundaries are
tive is a configuration that is functionally visible to users.
symmetric; that is, all the component
machines have an equal role in the opera-
6.2 Lightweight Processes
tion of the system, and hence each machine
has some degree of autonomy. Practically, A major problem in the design of any serv-
it is impossible to comply with such a prin- ice is the process structure of the server.
ciple. For instance, incorporating diskless Servers are supposed to operate efficiently
ACM Computing Surveys, Vol. 22, No. 4, December 1990
342 l E. Levy and A. Silberschatz
in peak periods when hundreds of active late in a common queue and threads are
clients need to be served simultaneously. A assigned to requests from the queue. The
single server process is certainly not a good advantages of using an LWPs scheme to
choice, since whenever a request necessi- implement the service are twofold. First,
tates disk I/O the whole service is delayed an I/O request delays a single thread, not
until the I/O is completed. Assigning a the entire service. Second, sharing common
process for each client is a better choice; data structures (e.g., the requests queue)
however, the overhead of multiplexing the among the threads is easily facilitated.
CPU among the processes (i.e., the context It is clear that some form of LWPs
switches) is an expensive price that must scheme is essential for servers to be scal-
be paid. able. Locus, Sprite, Andrew, use such
A related problem has to do with the fact schemes; in the future NFS will too. De-
that all the server processes need to share tailed studies of threads implementations
information, such as file headers and serv- can be found in Kepecs 1985 and Tevanian
ice tables. In UNIX 4.2 BSD processes et al. 1987.
are not permitted to share address
spaces, hence sharing must be done exter- 7. UNIX UNITED
naliy by using files and other unnatural
mechanisms. The UNIX United project from the lJni-
It appears that one of the best solutions versity of Newcastle upon Tyne, England,
for the server architecture is the use of is one of the earliest attempts to extend the
Lightweight Processes (LWPs) or Threads. UNIX file system to a distributed one with-
A thread is a process that has very little out modifying the UNIX kernel. In UNIX
nonshared state. A group of peer threads United, a software subsystem is added
share code, address space, and operating to each of a set of interconnected UNIX
system resources. An individual thread has systems (referred to as component or con-
at least its own register state. The extensive stituent systems), so as to construct a dis-
sharing makes context switches among peer tributed system that is functionally
threads and threads’ creation inexpensive, indistinguishable from a conventional cen-
compared with context switches among tra- tralized UNIX system.
ditional, heavy-weight processes. Thus, Originally, the component systems were
blocking a thread and switching to another perceived as mainframes functioning as
thread is a reasonable solution to the prob- time-sharing UNIX systems, and indeed
lem of a server handling many requests. the original implementation was based
The abstraction presented by a group of on a set of PDP-11’s connected by a
LWPs is that of multiple threads of control Cambridge Ring.
associated with some shared resources. The system is presented in two levels of
There are many alternatives regarding detail: First, an overview of UNIX United
threads; we mention a few of them briefly. is given. Then the implementation, the
Threads can be supported above the kernel, Newcastle Connection layer, is described.
at the user level (as done in Andrew) or by
the kernel (as in Mach [Tevanian et al. 7.1 Overview
19871). Usually, a lightweight process is not
bound to a particular client. Instead, it Any number of inter-linked UNIX system
serves single requests of different clients. can be joined to compose a UNIX United
Scheduling threads can be preemptive or system. Their naming structures (for files,
nonpreemptive. If threads are allowed to devices, directories, and commands) are
run to completion, their shared data need joined together into a single naming struc-
not be explicitly protected. Otherwise, some ture, in which each component system is to
explicit locking mechanism must be used all intents and purposes just a directory.
to synchronize the accesses to the shared Ignoring for the moment questions regard-
data. ing accreditation and access control, the
Typically, when LWPs are used to im- resulting system is one in which each user
plement a service, client requests accumu- can read or write any file, use any device,
ACM Computing Surveys, Vol. 22, No. 4, December 1990
Distributed File Systems l 343
execute any command, or inspect any di- /f3, file fl is referred to as /../fl, file f2 is
rectory, regardless of the system to which referred to as /../../unix2/f2, and finally
it belongs. That is, network transparency file f4 is referred to as /../../unix2/dir/
is supported. unix4/f4.
The component unit is a complete UNIX Observe that users are aware of the up-
tree belonging to a certain machine. The ward boundaries of the current component
position of these component units in the unit since they must use the ‘I..’ syntax
naming hierarchy is arbitrary. They can whenever they wish to ascend outside of
appear in the naming structure in positions their current machine. Hence, UNIX
subservient to other component units (di- United fails to provide complete location
rectly or via intermediary directories). It is transparency.
often convenient to set the naming struc- The traditional root directories (e.g.,
ture to reflect organizational hierarchy /dev, /bin) are maintained for each ma-
of the environment in which the system chine separately. Because of the relative
exists. naming scheme, they are named, from
In conventional UNIX the root of a file within a component system, in the exact
hierarchy is its own parent and is the only way as in conventional UNIX (e.g., just
directory not assigned a string name. In /dev). Each component system has its own
UNIX United, each component’s root is set of named users and its own administra-
still referred to as ‘/’ and still serves as the tor (superuser). The latter is responsible
starting point of all pathnames starting for the accreditation for users of his or her
with a ‘1’. Roots of component units, how- own system as well as remote users. For
ever, are assigned names so that they uniqueness, remote users’ identifiers are
become accessible and distinguishable ex- prefixed with the name of their original
ternally. Also, a subservient component can system. Accesses are governed by the stan-
access its superior system by referring to dard UNIX file protection mechanisms,
its own root parent, (i.e., ‘/..‘). Therefore, even if they cross component boundaries.
there is only one root that is its own parent That is, there is no need for users to log in
and that is not assigned a string name; separately or provide passwords when they
namely, the root of the composite name access remote files if they are properly ac-
structure, which is just a virtual node credited. Accreditation for remote users
needed to make the whole structure a single must be arranged with the system admin-
tree. Under this conventions, there is no istrator separately.
notion of absolute pathname. Each path.. UNIX United is well suited for a diverse
name is relative to some context, either the internetwork topology, spanning LANs, as
current working directory or the current well as direct links and even WANS. The
component unit. logical name space needs to be properly
In Figure 2, the directories unixl, mapped onto routing information in such a
unix2, unix3, and unix4 are component complex internetwork. An important de-
units (i.e., complete UNIX hierarchies) be- sign principle is that the naming hierarchy
longing to machines by the same names. needs bear no relationship to the network
For instance, all the files descending from topology.
unix2, except files that descend from
unix4, are stored on the machine unix2. 7.2 Implementation-Newcastle Connection
The tree rooted at Unix4 descends from
the directory dir, which is an ordinary (lo- The Newcastle Connection is a (user-level)
cal) directory of unix2. To illustrate the software layer incorporated in each com-
relative pathnames, note that /../unix2/f2 ponent system. This layer separates be-
is the name of the file f2 on the system tween the UNIX kernel on one hand, and
unix2 from within the unixl system. applications, command programs and the
From the unix3 system, the same file is shell on the other hand. It intercepts all
referred to as /../..unix2jf2. Now, suppose system calls concerning files and filters out
the current root (‘/‘) is as shown by the those that have to be redirected to remote
arrow. Then file f3 can be referenced as systems. Also, the Connection layer accepts

ACM Computing Surveys, Vol. 22, No. 4, December 1990


344 l E. Levy and A. Silberschatz

f4
Figure 2. UNIX United hierarchy.

system calls that have been directed to it changing the overall structure a very ex-
from other systems. Remote layers manage pensive (and hence infrequent) event.
communication by the means of a RPC Some leaves of the partial structure stored
protocol. Figure 3 is a schematic view of locally correspond to remote roots of other
the software architecture just described. parts of the global file system. These leaves
Incorporating the Connection layer pre- are specially marked and contain addresses
serves both the same UNIX system call of the appropriate storage sites of the de-
interface and the UNIX kernel, in spite of scending file systems. Pathname traversals
the extensive remote activity carried out by have to be continued remotely when en-
the system. The penalty for preserving the countering such marked leaves and, in fact,
kernel intact is the fact that the service is can span more than two systems until the
implemented as user-level daemon pro- target file is located. Therefore, a strict
cesses (as opposed to a kernel implemen- client-server pair model is not preserved.
tation), which slow down remote operation. Once a name is resolved and the file is
Each Connection layer stores a partial opened, it is accessed using file descriptors.
skeleton of the overall naming structure. The Connection layer marks descriptors
Each system stores its own file system lo- that refer to remote files and keeps network
cally. In addition, each system maintains addresses and routing information for them
information on the fragments of the overall in a per-process table.
name structure that relate it to its neigh- The actual remote file accesses are car-
boring systems in the naming structure ried out by a set of file server processes on
(i.e., systems that can be reached via trav- the target system. Each client has its own
ersal of the naming tree without passing file server process with which it communi-
through another system). For instance, re- cates directly. The initial connection is es-
fer to Figure 2. System Unix2 is aware of tablished with the aid of a spawner process
the position of systems unixl, unix2, and that has a standard fixed name that makes
unix4 in the global tree. Figure 4 shows it callable from any external process. This
the relative positioning of the component spawner process performs the remote ac-
units of the global name space that system cess rights checks according to a machine-
unix2 knows about. user identification pair. It also converts this
The fragments maintained by different identification to a valid local name. For the
systems overlap and hence must remain sake of preserving UNIX semantics, once a
consistent, a requirement that makes user process forks, its file service process

ACM Computing Surveys, Vol. 22, No. 4, Dxember 1990


Distributed File Systems l 345

~~~~

Figure 3. Schematic view of the UNIX United architecture.

mix1

unix4

Figure 4. Partial skeleton UNIX2 has (see Figure 2).

forks as well. File descriptors (not lower l Connection Layer. Conceptually, the con-
level means such as i-nodes) are used to nection layer implementation is ele-
identify files between a user and its file gant and simple. It is a modular sub-
server. This is a stateful service scheme and system interfacing two existing layers
hence does not excel in terms of robustness. without modifying either of them or their
original semantics and still extending
7.3 Summary their capabilities by large. The imple-
mentation strategy is by relinking appli-
The overall profile of the UNIX United cation programs with the Connection
system can be characterized by the follow- layer library routines. These routines
ing prominent features: intercept file system calls and forward
l Logical Name Structure. The UNIX the remote ones to user-level remote
United name structure is a hierarchy daemons at the remote sites.
composed of component UNIX subtrees. Even though UNIX United is outdated, it
There is an explicitly visible correspon- serves our purposes well in demonstrating
dence between a machine and a subtree network transparency without location
in the structure; hence, machine bound- transparency, a simple implementation
aries are noticeable. Users must use the technique, and the issue of autonomy of
‘/..’ trap to get out of the current com- component systems.
ponent unit. There are no absolute path-
names-all pathnames are relative to
some context. 8. LOCUS
l Recursive Structure. Structuring a UNIX Locus is an ambitious project aimed at
United system out of a set of component building a full-scale distributed operating
systems is a recursive process akin to a system. The system is upward compatible
recursive definition of a tree. In theory, with UNIX, but unlike NFS, UNIX United,
such a system can be indefinitely exten- and other UNIX-based distributed sys-
sible. The building block of this recursive tems, the extensions are major ones and
scheme is an autonomous and complete necessitate a new kernel rather than a mod-
UNIX system. ified one. Locus stands out among other

ACM Computing Surveys. Vol. 22, No. 4, December 1990


346 l E. Levy and A. Silberschatz
systems by hosting a variety of sophisti- nect the network into two or more parti-
cated features such as automatic manage- tions cr disconnect,ed subnetworks. As long
ment of replicated data, atomic file update, as at least one copy of a file is available in
remote tasking, ability to withstand (to a a partition, read requests are served, and it
certain extent) failures and network parti- is still guaranteed that the version read is
tions, and full implementation of nested the most recent one available in the parti-
transactions [Weinstein et al. 19851. The tion. Automatic mechanisms take care
system has been operational at UCLA for to update stale copies of files upon the
several years on a set of mainframes and reconnection of partitions.
workstations connected by an Ethernet. Emphasizing high performance in the de-
A general strategy for extending Locus to sign of Locus led to incorporating network-
an internet environment is outlined in ing functions (such as formatting, queuing,
Sheltzer and Popek [1986]. transmitting, and retransmitting messages)
The heart of the Locus architecture is its into the operating system. Specialized re-
DFS. We first give an overview of the fea- mote operations protocols were devised for
tures and general implementation philoso- kernel-to-kernel communication, in con-
phy of the file system. Then we discuss the trast to the prevalent approach of using an
static nature of the file system (Sections off-the-shelf package (e.g., an RPC proto-
8.2) and its dynamics (Sections 8.3 and 8.4). col). Lack of multilayering (as suggested in
We devote section 8.5 of the operation of the IS0 standard [Zimmermann 19801) en-
the system in a faulty environment. abled achieving high performance for
remote operations. On the other hand,
this snecialized protocol hampers Locus
8.1 Overview
portability to different networks and
The Locus file system presents a single tree file systems.
structure naming hierarchy to users and An efficient but limited kernel-supported
applications. This structure covers all ob- LWP facility is devised for serving remote
jects (files, directories, executable files, and requests. These are processes that have no
devices) of all the machines in the system. nonprivileged address space, their code and
Locus names are fully location transparent; stack are resident in the operating system
from a name of an object it is not possible nucleus, and they can call internal system
to discern its location in the network. To a routines directly. These processes are as-
first approximation, there is almost no way signed to serve net.worb requests that ac-
to distinguish the Locus name structure cumulate in a system queue. The system is
from a standard UNIX tree. configured with some number of these pro-
A Locus file may correspond to a set of cesses, but that number is automatically
copies distributed on different sizes. An and dynamically altered during system
additional transparency dimension is intro- operation.
duced since it is the system responsibility
to keep all copies up to date and assure that 8.2 Name Structure
access requests are served by the most re-
cently available version. As an option, users The logical name structure disguises both
may have control over both the number location and replication details from users
and location of replicated files. In Locus, and applications. A removable file system,
file replication serves mainly to increase in Locus terms, is called a filegroup. A
availability for reading purposes. A primary filegroup is the component unit in Locus.
copy approach is adopted for modifications. Virtually, logical filegroups are joined to-
Locus strives to provide UNIX semantics gether to form this unified structure. Phys-
in the distributed and replicated environ- ically, a logical filegroup is mapped to
ment in which it operates. Additionally, it multiple physical containers (called also
supports locking of files and atomic update. packs) residing at various sites and storing
Fault tolerance issues are emphasized in replicas of the files of that filegroup. The
Locus design. Network failure may discon- pair <logical filegroup number, i-node

ACM Computing Surveys, Vol. 22, No. 4, December 1990


Distributed File Systems l 347
number>, referred to as a file designator these traditional file names to hardware-
serves as low-level, location-independent and site-specific files.
file identifier. The designator itself hides
both location and replication details, since
it points to a file in general and not to a 8.3 File Operations
particular replica. In contrast to the prevalent model of a
Each site has a consistent and complete server-client pair involved in a file access,
view of the logical name structure. A logical Locus distinguishes three logical roles in
mount table is globally replicated and con- file accesses; each one potentially per-
tains an entry for each logical filegroup. formed by a different site:
The entry records the file designator of the
directory over which the filegroup is logi- Using Site (US) issues the requests to
cally mounted and an indication of which open and access a remote file.
site is currently responsible for access syn-
Storage Site (SS) is the selected site to
chronization (the function of this site is
serve the requests.
explained subsequently) within the file-
group. A protocol, implemented within the Current Synchronization Site (CSS) en-
mount and unmount Locus system calls, forces a global synchronization policy for
performs update of the logical mount tables a filegroup and selects an SS for each
on all sites when necessary. Open request referring to a file in the
On the physical level, physical containers filegroup. There is at most one CSS for
correspond to disk partitions. One of the each filegroup in any set of communicat-
packs is designated as the Primary Copy. A ing sites (i.e., partition). The CSS main-
file must be stored at the site of the primary tains the version number and a list of
copy and, in addition, can be stored at any physical containers for every file in the
subset of the other sites where there exists filegroup.
a pack corresponding to its filegroup. Thus,
the primary copy stores the filegroup com- The following sections describe the file op-
pletely, whereas the rest of the packs might erations as they are carried out by the above
be partial. Replication is especially useful entities. Related synchronization issues are
for directories in the high levels of the name described in Section 8.3.4.
hierarchy. Such directories are rarely up-
dated and are crucial for pathnames trans- 8.3.1 Opening and Reading a File
lation of files.
The various copies of a file are assigned We first describe how a file is opened and
to the same i-node number on all the file- read given its designator and then describe
group’s packs. Consequently, a pack has an how a designator is obtained from a string
empty i-node slot for all files it does not pathname.
store. Data page numbers may be different Given a file designator, opening the file
on different packs, hence reference over the commences as follows. The US determines
network to data pages use logical page num- the relevant CSS by looking up the file-
bers rather than physical ones. Each pack group in the logical mount table, then for-
has a mapping of these logical numbers to wards the Open request to the CSS. The
physical numbers. To facilitate automatic CSS polls potential SSs for that file to
replication management, each i-node of a decide which one will act as the real SS. In
file copy contains a version number, deter- its polling messages: the CSS includes the
mining which copy dominates other copies. version number for the particular file so the
Whereas globally unique file naming is potential SSs can, by comparing this num-
very important most of the time, certain ber to their own, decide whether or not
files and directories are hardware and site their copy is up to date. The CSS selects
specific (e.g., /bin is hardware-specific, and an SS by considering the response it got
/dev is site-specific). Locus provides trans- back from the candidate sites and sends the
parent means for translating references to selected SS identity to the US. Both the

ACM Computing Surveys, Vol. 22, No. 4, December 1990


348 l E. Levy and A. Silberschatz
CSS and the SS allocate in-core i-node 8.3.2 Modifying a File
structures for the opened file. The CSS In Locus, a primary copy policy is used for
needs this information to make future syn- file modifications. The CSS must select the
chronization decisions, and the SS main- primary copy pack site as the SS if the
tains the i-node to serve forthcoming Open is for a Write. The act of modifying
accesses efficiently. data takes on two forms. If the modification
After a file is open, a Read request is sent does not include the entire page, the old
directly to the SS without the CSS inter- page is first read from the SS using the
vention. A Read request contains the des- Read protocol. If an entire page is modified,
ignator of the file, the logical number of the a buffer is set up at the US without any
needed page within that file, and a hint as reads. In either case, after changes are
to where the SS might store the file’s made, possibly by delayed-write, the page
i-node in main memory. Once the i-node is is sent back to the SS. All modified pages
found, the SS translates the logical page must be flushed to the SS before a modified
number to physical number, and a standard file can be closed.
low-level routine is called to allocate a If a file is closed by the last user process
buffer and get the appropriate page from at a US, the SS and CSS must be informed
disk. The buffer is queued on the network so that they can deallocate in-core i-node
queue for transmission back to the US as a structures and the CSS can alter state data
response, where it is stored in a kernel that might affect its next synchronization
buffer. Once a page is fetched to the US, decision.
further Read calls are serviced from the Caching of data pages is relied upon
kernel buffer. As in the case of local disk heavily in both Read and Write operations.
Reads, read-ahead is useful to speed up The validation of the cached data is dealt
sequential reading, both at the US and the with in Section 8.3.4.
SS. If a process loses its connection with a
file it is reading remotely, the system at- 8.3.3 Commit and Abort
tempts to reopen a different copy of the
same version of the file. Locus uses the following shadow page
Translating a pathname into a file des- mechanism for implementing atomic com-
ignator is carried out by seemingly conven- mit. When a file is modified, disk pages are
tional pathname traversal mechanism since allocated at the SS; these pages are the
pathnames are regular UNIX pathnames, shadow pages. The in-core copy of the disk
with no exception (unlike UNIX United). i-node is updated to point to the shadow
Every lookup of a component of the path- pages. The disk i-node is kept intact, point-
name within a directory involves opening ing to the original pages. To abort a set of
the latter and reading from it. These oper- changes, both the in-core i-node informa-
ations are conducted according to the above tion and the shadow pages used to record
protocols (i.e., directory entries are also the changes are discarded. The atomic
cached in US buffers). There is no parallel Commit operation consists of moving the
to NFS’s remote lookup operation. The ac- incore i-node to the disk i-node. After that,
tual directory searching is performed by the the file contains the new information. The
client rather than by the server. A directory US function never deals with actual disk
opened for pathname searching is not open pages, but rather with logical pages. Thus,
for normal Read, but instead for an internal the entire shadow page mechanism is im-
unsynchronized Read. The distinction is plemented at the SS and is transparent to
that no global synchronization is needed the US.
and no locking is done while the reading is Locus deals with file modification by first
performed; that is, updates to the directory committing the change to the primary copy.
can occur while the search is ongoing. Later, messages are sent to all other SSs
When the directory is local, even the CSS and to the CSS. At a minimum, these mes-
is not informed of such access. sages identify the modified file and contain

ACM Computing Surveys, Vol. 22, No. 4, December 1990


Distributed File Systems l 349

the new version number (in order to pre- managers operating at the corresponding
vent attempts to read the old version). It is storage sites.
the responsibility of these additional SSs The cached data pages are guaranteed to
to bring their version up to date by propa- contain valid data only when the files’s data
gating the entire file or just the changes. A token is present. When the write data token
queue of propagation requests is kept is taken from that site, the i-node, as well
within the kernel at each site, and a kernel as all modified pages, is copied back to the
process services the queue efficiently by SS. Since arbitrary changes (initiated by
issuing appropriate Read requests. This remote clients) may have occurred when
propagation procedure uses the standard the token was not present, all cached
commit mechanism. Thus, if contact with buffers are invalidated when the token is
the file containing the newer version is lost, released. When a data token is granted to
the local file is left with a coherent copy, a site, both the i-node and data pages need
albeit still out of date. Given this commit to be fetched from the SS. There are some
mechanism, one is always left with either exceptions to enforcing this policy. Some
the original file or a completely changed attribute reading and writing calls (e.g.,
file, but never with a partially made change, stat) as well as directory reading and mod-
even in the face of site failures. ifying (e.g., lookup) calls are not subject to
the synchronization constraints. These
calls are sent directly to the SS, where the
8.4 Synchronizing Accesses to Files
changes are made, committed, and propa-
The default synchronization policy in Lo- gated to all storage and using sites.
cus is to emulate UNIX semantics on file Alternatively to the default UNIX se-
accesses in a distributed environment. mantics, Locus offers facilities for locking
UNIX semantics can be implemented fairly entire files or parts of them. Locking can
easily by having the processes share the be advisory (only checked as a result of a
same operating system data structures and locking attempt) or enforced (checked on
caches and by using locks on data struc- all reads and writes). A process can choose
tures to serialize requests. In Locus the to either fail if it cannot immediately get a
sharing processes may not co-reside on the lock or wait for it to be released.
same machine, and hence the implementa-
tion is more complicated. 8.5 Operation in a Faulty Environment
Recall that UNIX semantics allow sev-
eral processes descending from the same The basic approach in Locus is to maintain,
ancestor process to share the same current within a single partition, consistency
position (offset) in a file. A single token among copies of a file. The policy is to
scheme is devised to preserve this special allow updates only in a partition that has
mode of sharing. Only when the token is the primary copy. It is guaranteed that the
present, can a site proceed with executing most recent version of a file in a partition
system calls needing the offset. is read. The latter guarantee applies to all
In UNIX, the same in-core i-node for a partitions.
file can be shared by several processes. In A central point addressed in this section
Locus, the situation is much more compli- is the reconciliation of replicated filegroups
cated since the i-node of the file, as well as residing at partitioned sites. During normal
the parts of the file itself, can be cached at operation, the commit protocol ascertains
several sites. Token schemes are used to proper propagation of updates as described
synchronize sharing of a file’s i-node and earlier. A more elaborate scheme has to be
data. An exclusive-writer-multiple-readers used by recovering sites wishing to bring
policy is enforced. Only a site with a write their packs up to date. To this end, the
token for a file may modify the file; system maintains a commit count for each
any site with a read token can read it. The filegroup, enumerating each commit of
token schemes are coordinated by token every file in the filegroup. Each pack has a

ACM Computing Surveys, Vol. 22, No. 4, December 1990


350 l E. Levy and A. Silberschatz

lower-water-mark (lwm) that is a commit and for all local resources being used by
count value up to which the system guar- processes local to site b. This substantial
antees that all prior commits are reflects in cleaning procedure is the penalty of the
the pack. Also, the primary copy pack (usu- state information kept by a!1 three sites
ally stored at the CSS) keeps a list enu- participating in file access.
merating the files in the filegroup and the Since directory updates are not restricted
corresponding commit counts of all the re- to be applied to the primary copy, conflicts
cent commits in secondary storage. When among updates in different partitions may
a pack joins a partition it attempts to con- arise [Walker et al. 19831. Because of the
tact the CSS and checks whether its lwm simple nature of directory modification,
is within the recent commit list range. If however, an automatic reconciliation pro-
this is the case, the pack site schedules a cedure is devised. This procedure is based
kernel process that brings the pack to a on comparing the i-nodes and string name
consistent state by copying only the files pairs of replicas of the same directory. The
that reflect commits later than that of the most extreme action taken is when the
site’s lwm. If the CSS is not available, same name string corresponds to two dif-
writing is disallowed in this partition, but ferent i-nodes (i.e., the same name is used
reading is possible after a new CSS is cho- for creating two different files) and
sen. The new CSS communicates with the amounts to altering the file names slightly
partition members so it will be informed of and notifying the files owners by electronic
the most recent available (in the partition) mail.
version of each file in the filegroup. Once
the new CSS accomplishes the objective,
8.8 Summary
other pack sites can reconcile themselves
with it. As a result, all communicating sites An overall profile and evaluation of Locus
see the same view of the filegroup, and this is summarized by pointing out the following
view is as complete as possible, given a issues:
particular partition. Note that since up-
dates are allowed within the partition with Distributed operating system. Because of
the primary copy and Reads are allowed in the multiple dimensions of transparency
the rest of the partitions, it is possible to in Locus, it comes close to the definition
Read out-of-date replicas of a file. Thus, of a truly distributed operating system
Locus sacrifices consistency for the ability in contrast to a collection of network
to continue to both update and read files in services [Tanenbaum and Van Renesse
a partitioned environment. 19851.
When a pack is too far out of date (i.e., Implementation strategy. Essent,ially,
its lwm indicates a prior value to the earli- kernel augmentation is the implementa-
est commit count value in the primary tion strategy in Locus. The common
copy commit list), the system invokes an pattern in Locus is kernel-to-kernel
application-level process to bring the file- communication via specialized, high-
group up to date. At this point, the system performance protocols. This strategy is
lacks sufficient knowledge of the most re- needed to support the philosophy of a
cent commits to identify the missing up- distributed operating system.
dates. Instead, the site must inspect the Replication. A primary copy replication
entire i-node space to determine which files scheme is used in Locus. The main merit
in its pack are out of date. of this kind of replication scheme is in-
When a site is lost from an operational creased availability of directories that ex-
Locus network, a clean-up procedure is nec- hibit high read-write ratio. Availability
essary. Essentially, once site a has decided for modifying files is not increased by the
that site b is unavailable, site a must invoke primary copy approach. Handling repli-
failure handling for remote resources that cation transparently is one of the reasons
processes local to a were using at site b, for introducing the CSS entity, which is

ACM Computing Surveys, Vol. 22, No. 4, December 1990


Distributed File Systems l 351

a third entity taking part in a remote l A logical mount table replicated at all
access. In this context, the CSS functions sites is clearly not a scalable mecha-
as the mapping from an abstract file to a nism.
physical replica. l Extensive message traffic and server
l Access synchronization. UNIX seman- load caused by the complex synchro-
tics are emulated to the last detail, in nization of accesses needed to provide
spite of caching at multiple USs. Alter- UNIX semantics.
natively, locking facilities are provided. l UNIX compatibility. The way Locus
l Fault tolerance. Substantial effort has handles remote operation is geared to
been devoted to designing mechanisms emulation of standard UNIX. The im-
for fault tolerance. A few are an atomic plementation is merely an extension of
update facility, merging replicated packs UNIX implementation across a net-
after recovery, and a degree of indepen- work. Whenever buffering is used in
dent operation of partitions. The effects UNIX, it is used in Locus as well.
can be characterized as follows: UNIX compatibility is indeed retained;
Within a partition, the most recent, however, this approach has some in-
available version of a file is read. The herent flaws. First, it is not clear
primary copy must be available for whether UNIX semantics are appro-
write operations. priate. For instance, the mechanism for
supporting shared file offset by remote
The primary copy of a file is always up
processes is complex and expensive. It
to date with the most recent committed
is unclear whether this peculiar mode
version. Other copies may have either
of sharing justifies this price. Second,
the same version or an older version,
using caching and buffering as done in
but never a partially modified one.
UNIX in a distributed system has some
A CSS function introduces an addi- ramifications on the robustness and re-
tional point of failure. For a file to be coverability of the system. Compatibil-
available for opening, both the CSS ity with UNIX is indeed an important
for the filegroup and an SS must be design goal, but sometimes it obscures
available. the development of an advanced dis-
Every pathname component must be tributed and robust system.
available for the corresponding file to
be available for opening.
9. SUN NETWORK FILE SYSTEM
A basic questionable decision regarding
fault tolerance is the extensive use of in- The Network File System (NFS) is a name
core information by the CSS and SS func- for both an implementation and a specifi-
tions. Supporting the synchronization pol- cation of a software system for accessing
icy is a partial cause for maintaining this remote files across LANs. The implemen-
information; however, the price paid during tation is part of the SunOS operating sys-
recovery is enormous. Besides, explicit tem, which is a flavor of UNIX running on
deallocation is needed to reclaim this in- Sun workstations using an unreliable da-
core space, resulting in a pure overhead of tagram protocol (UDP/IP protocol [Postel
message traffic. 19801) and Ethernet. The specification and
implementation are intertwined in the fol-
l Scalability. Locus does not lend itself to lowing description; whenever a level of de-
very large distributed system environ- tail is needed we refer to the SunOS
ment, mainly because of the following implementation, and whenever the descrip-
reasons: tion is general enough it also applies to the
l One CSS per file group can easily be- specification.
come a bottleneck for heavily accessed The system is presented in three levels
filegroups. of detail. First (in Section 9.1), an overview

ACM Computing Surveys, Vol. 22, No. 4, December 1990


352 . E. Levy and A. Silberschatz
is given. Then, two service protocols that the latest NFS version, diskless worksta-
are the building blocks for the implemen- tions can even mount their own roots from
tation are examined (Section 9.2). Finally servers (Version 4.0, May 1988 described in
(in Section 9.3), a description of the SunOS Sun Microsystems Inc. [ 19881). In previous
implementation is given. NFS versions, a diskless workstation de-
pends on the Network Disk (ND) protocol
9.1 Overview
that provides raw block I/O service from
remote disks; the server disk was parti-
NFS views a set of interconnected worksta- tioned and no sharing of root file systems
tions as a set of independent machines with was allowed.
independent file systems. The goal is to One of the design goals of NFS is to
allow some degree of sharing among these provide file services in a heterogeneous en-
file systems in a transparent manner. Shar- vironment of different machines, operating
ing is based on server-client relationship. A systems, and network architecture. The
machine may be, and often is, both a client NFS specification is independent of these
and a server. Sharing is allowed between media and thus encourages other imple-
any pair of machines, not only with dedi- mentations. This independence is achieved
cated server machines. Consistent with the through the use of RPC primitives built on
independence of a machine is the critical top of an External Date Representation
observation that NFS sharing of a remote (XDR) protocol-two implementation-
file system affects only the client machine independent interfaces [Sun Microsystems
and no other machine. Therefore, there is Inc. 19881. Hence, if the system consists of
no notion of a globally shared file system heterogeneous machines and file systems
as in Locus, Sprite, UNIX United, and that are properly interfaced to NFS, file
Andrew. systems of different types can be mounted
To make a remote directory accessible in both locally and remotely.
a transparent manner from a client ma-
chine, a user of that machine first has to
carry out a mount operation. Actually, only 9.2 NFS Services
a superuser can invoke the mount opera-
The NFS specification distinguishes be-
tion. Specifying the remote directory as an
tween the services provided by a mount
argument for the mount operation is done
mechanism and the actual remote file ac-
in a nontransparent manner; the location
cess services. Accordingly, two separate
(i.e., hostname) of the remote directory has
protocols are specified for these services-
to be provided. From then on, users on the
client machine can access files in the re- a mount protocol and a protocol for remote
file accesses called the NFS protocol.
mote directory in a totally transparent
The protocols are specified as sets of
manner, as if the directory were local. Since
RPCs that define the protocols’ function-
each machine is free to configure its own
ality. These RPCs are the building blocks
name space, it is not guaranteed that all
used to implement transparent remote file
machines have a common view of the
access.
shared space. The convention is to con-
figure the system to have a uniform name
space. By mounting a shared file system 9.2.1 Mount Protocol
over user home directories on all the ma-
chines, a user can log in to any workstation We first illustrate the semantics of mount-
and get his or her home environment. Thus, ing by a series of examples. In Figure 5a,
user mobility can be provided, although the independent file systems belonging to
again by convention. the machines named client, serverl, and
Subject to access rights accreditation, po- server2 are shown. At this stage, at each
tentially any file system or a directory machine only the local files can be accessed.
within a file system can be remotely The triangles in the figure represent sub-
mounted on top of any local directory. In trees of directories of interest in this

ACM Computing Surveys, Vol. 22, No. 4, December 1990


Distributed File Systems l 353
Client: Serverl: Server2:

\ “SI‘

\
shared ,.:..
: ‘..,
.:’ ..,
,:’ ..,
,:’ ‘4,
,:’ dir1 ‘...
,:’ ‘.._
: . . . . . . .. . . . . . . .._............ 1..
(a)

Client: Client:

“Sr

,:’ ‘9..
,;’
,:’ dir1 ““..
,:’
:. ::..
(b)

Figure 5. NFS joins independent file systems (a), by mounts (b), and cascading mounts (c).

example. In Figure 5b, the effects of the mount mechanism does not exhibit a
the mounting server l:/usr/shared over transitivity property. In Figure 5c we illus-
client:/usr/local are shown. This figure trate cascading mounts by continuing our
depicts the view users on client have of example. The figure shows the result of
their file system. Observe that any file mounting server2:/dir2/dir over client:/
within the dir1 directory, for instance, can usr/local/dir 1, which is already remotely
be accessed using the prefix /usr/local/ mounted from serverl. Files within dir3
dir1 in client after the mount is complete. can be accessed in client using the prefix
The original directory /usr/local on that /usr/local/dir 1.
machine is not visible any more. The mount protocol is used to establish
Cascading mounts are also permitted. the initial connection between a server and
That is, a file system can be mounted over a client. The server maintains an export
another file system that is not a local one, list (the /etc/exports in UNIX) that spec-
but rather a remotely mounted one. A ma- ifies the local file systems it exports for
chine’s name space, however, is affected mounting, along with names of machines
only by those mounts the machine’s own permitted to mount them. Any directory
superuser has invoked. By mounting a re- within an exported file system can be re-
mote file system, access is not gained for motely mounted by an accredited machine.
other file systems that were, by chance, Hence, a component unit is such a direc-
mounted over the former file system. Thus, tory. When the server receives a mount

ACM Computing Surveys, Vol. 22, No. 4, December 1990


354 l E. Levy and A. Silberschatz

request that conforms to its export list, it server crash. Consequently, this list might
returns to the client a file handle that is include inconsistent data and should be
the key for further accesses to files within treated only as a hint.
the mounted file system. The file handle A further implication of the stateless
contains all the information the server server philosophy and a result of the syn-
needs to distinguish individual files it chrony of an RPC is that modified data
stores. In UNIX terms, the file handle con- (including indirection and status blocks)
sists of a file system identifier and an i- must be committed to the server’s disk
node number to identify the exact mounted before the call returns results to the client.
directory within the exported file system. The NFS protocol does not provide concur-
The server also maintains a list of the rency control mechanisms. The claim is
client machines and the corresponding cur- that since locks management is inherently
rently mounted directories. This list is stateful, a service outside the NFS should
mainly for administrative purposes, such as provide locking. It is advised that users
for notifying all clients that the server is would coordinate access to shared files us-
going down. Adding and deleting an entry ing mechanisms outside the scope of NFS
in this list is the only way the server state (e.g., by means provided in a database man-
is affected by the mount protocol. agement system).
Usually a system has some static mount-
ing preconfiguration that is established at 9.3 Implementation
boot time; however, this layout can be mod-
ified (/etc/fstab in UNIX). In general, Sun’s implementation of NFS
is integrated with the SunOS kernel for
reasons of efficiency (although such inte-
9.2.2 NFS Protocol gration is not strictly necessary). In this
section we outline this implementation.
The NFS protocol provides a set of remote
procedure calls for remote file operations.
The procedures support the following 9.3.1 Architecture
operations: The NFS architecture is schematically de-
Searching for a file within a directory picted in Figure 6. The user interface is the
(i.e., lookup). UNIX system calls interface based on the
Open, Read, Write, Close calls, and file
Reading a set of directory entries. descriptors. This interface is on top of a
Manipulating links and directories. middle layer called the Virtual File System
Accessing file attributes. (VFS) layer. The bottom layer is the one
Reading and writing files. that implements the NFS protocol and is
called the NFS layer. These layers comprise
These procedures can be invoked only after the NFS software architecture. The figure
having a file handle for the remotely also shows the RPC/XDR software layer,
mounted directory. Recall that the mount local file systems, and the network and thus
operation supplies this file handle. can serve to illustrate the integration of a
The omission of Open and Close opera- DFS with all these components. The VFS
tions is intentional. A prominent feature of serves two important functions:
NFS servers is that they are stateless.
There are no parallels to UNIX’s open files l It separates file system generic opera-
table or file structures on the server side. tions from their implementation by
Maintaining the clients list mentioned in defining a clean interface. Several imple-
Section 9.2.1 seems to violate the stateless- mentations for the VFS interface may
ness of the server. The client list, however, coexist on the same machine, allowing
is not essential in any manner for the cor- transparent access to a variety of types
rect operation of the client or the server of file systems mounted locally (e.g., 4.2
and hence need not be restored after a BSD or MS-DOS).

ACM Computing Surveys, Vol. 22, No. 4, December 1990


Distributed File Systems l 355
Client Server

VFS intelface ( VFS iryfncc 1

RPC/XDR RPUXDR

0 disk
,” Network
7 I
Figure 6. Schematic view of the NFS architecture.

l The VFS is based on a file representation operation by a regular system call. The
structure called a unode, which contains operating system layer maps this call to a
a numerical designator for a file that is VFS operation on the appropriate vnode.
networkwide unique. (Recall that UNIX- The VFS layer identifies the file as a remote
i-nodes are unique only within a single one and invokes the appropriate NFS pro-
file system.) The kernel maintains one cedure. An RPC call is made to the NFS
vnode structure for each active node (file service layer at the remote server. This call
or directory). Essentially, for every file is reinjected into the VFS layer, which finds
the vnode structures complemented by that it is local and invokes the appropriate
the mount table provide a pointer to its file system operation. This path is retraced
parent file system, as well as to the file to return the result. An advantage of this
system over which it is mounted. architecture is that the client and the server
are identical; thus, it is possible for a ma-
Thus, the VFS distinguishes local files chine to be a client, or a server, or both.
from remote ones, and local files are further The actual service on each server is per-
distinguished according to their file system formed by several kernel processes, which
types. The VFS activates file system spe- provide a temporary substitute to a LWP
cific operations to handle local requests facility.
according to their file system types and
calls the NFS protocol procedures for re- 9.3.2 Pathname Translation
mote requests. File handles are constructed
from the relevant vnodes and passed as Pathname translation is done by breaking
arguments to these procedures. the path into component names and doing
As an illustration of the architecture, let a separate NFS lookup call for every pair
us trace how an operation on an already of component name and directory vnode.
open remote file is handled (follow the ex- Thus, lookups are performed remotely by
ample in Figure 6). The client initiates the the server. Once a mount point is crossed,

ACM Computing Surveys, Vol. 22, No. 4, December 1990


356 . E. Levy and A. Silberschatz
every component lookup causes a separate a remote operation and an RPC. Instead,
RPC to the server. This expensive path- file blocks and file attributes are fetched by
name traversal scheme is needed, since the RPCs and cached locally. Future re-
each client has a unique layout of its logical mote operations use the cached data subject
name space, dictated by the mounts if per- to some consistency constraints.
formed. It would have been much more There are two caches: file blocks cache
efficient to pass a pathname to a server and and file attribute (i-node information)
receive a target vnode once a mount point cache. On a file open, the kernel checks
was encountered. But at any point there with the remote server about whether to
can be another mount point for the partic- fetch or revalidate the cached attributes by
ular client of which the stateless server is comparing time stamps of the last modifi-
unaware. cation. The cached file blocks are used only
To make lookup faster, a directory name if the corresponding cached attributes are
lookup cache at the client holds the vnodes up to date. The attribute cache is updated
for remote directory names. This cache whenever new attributes arrive from the
speeds up references to files with the same server after a cache miss. Cached attributes
initial pathname. The directory cache is are discarded typically after 3 s for files or
discarded when attributes returned from 30 s for directories. Both read-ahead and
the server do not match the attributes of delayed-write techniques are used between
the cached vnode. the server and the client [Sun Microsys-
Recall that mounting a remote file sys- tems Inc. 881. (Earlier version of NFS used
tem on top of another already mounted write-on-close [Sandberg et al. 19851). The
remote file system (cascading mount) is caching unit is fairly large (8Kb) for per-
allowed in NFS. A server cannot, however, formance reasons. Clients do not free
act as an intermediary between a client and delayed-write blocks until the server con-
another server. Instead, a client must es- firms the data are written to disk. In con-
tablish a direct server-client connection trast to Sprite, delayed-write is retained
with the second server by mounting the even when a file is open concurrently in
desired server directory. Therefore, when a conflicting modes. Hence, UNIX semantics
client does a lookup on a directory on which are not preserved.
the server has mounted a file system, the Tuning the system for performance
client sees the underlying directory instead makes it difficult to characterize the shar-
of the mounted directory. When a client ing semantics of NFS. New files created on
has a cascading mount, more than one a machine may not be visible elsewhere for
server can be involved in a pathname trav- 30 s. It is indeterminate whether writes to
ersal. Each component lookup is, however, a file at one site are visible to other sites
performed between the original client and that have the file open for reading. New
some server. opens of that file observe only the changes
that have already been flushed to the
server. Thus, NFS fails to provide either
9.3.3 Caching and Consistency strict emulation of UNIX semantics or any
With the exception of opening and closing other clear semantics.
files, there is almost a one-to-one corre-
spondence between the regular UNIX sys- 9.4 Summary
tem calls for file operations and the NFS
protocol RPCs. Thus, a remote file opera- l Logical name structure. A fundamental
tion can be translated directly to the cor- observation is that every machine estab-
responding RPC. Conceptually, NFS lishes its own view of the logical name
adheres to the remote service paradigm, but structure. There is no notion of global
in practice buffering and caching tech- name hierarchy. Each machine has its
niques are used for the sake of performance. own root serving as a private and absolute
There is no direct correspondence between point of reference for its own view of the

ACM Computing Surveys, Vol. 22, No. 4, December 1990


Distributed File Systems l 357

name structure. Selective mounting of cannot be characterized clearly, since


parts of file systems upon explicit request they are timing dependent.
allows each machine to obtain its unique Finally, it should be realized that NFS is
view of the global file system. As a result, commercially available, has very reason-
users enjoy some degree of independence, able performance, and is perceived as a
flexibility, and privacy. It seems that de facto standard in the user community.
the penalty paid for this flexibility is
administrative complexity.
Network service versus distributed op- 10. SPRITE
erating system. NFS is a network service Sprite is an experimental, distributed op-
for sharing files rather than an integral erating system under development at the
component of a distributed operating sys- University of California at Berkeley. It
tem [Tanenbaum and Van Renesse is part of the Spur project, whose goal
19851. This characterization does not is the design and construction of high-
contradict the SunOS kernel implemen- performance multiprocessor workstation
tation of NFS, since the kernel integra- [Hill et al. 19861. A preliminary version of
tion is only for performance reasons. Sprite is currently operational on intercon-
Being a network service has two main nected Sun workstations.
implications. First, remote-file sharing is Section 10.1 gives an overview of the file
not the default; the service initiating re- system and related aspects. Section 10.2
mote sharing (i.e., mounting) has to be elaborates on the file lookup mechanism
explicitly invoked. Moreover, the first (called prefix tables) and Section 10.3 on
step in accessing a remote file, the mount t.he caching methods used in the file system.
call, is a location dependent one. Second,
perceiving NFS as a service and not as 10.1 Overview
part of the operating system allows its
design specification to be implementa- Sprite designers envision the next genera-
tion independent. tion of workstations as powerful machines
with vast main memory. Currently, work-
Remote service. Once a file can be ac-
stations have 4 to 32Mb of main memory.
cessed transparently I/O operations are
Sprite designers predict that memories of
performed according to the remote serv-
100 to 500Mb will be commonplace in a few
ice method: The data in the file are not
years. Their claim is that by caching files
fetched en masse; instead, the remote site
from dedicated servers, the large physical
potentially participates in each Read and
memories can compensate for lack of local
Write operation. NFS uses caching to
disks in clients’ workstations.
improve performance, but the remote site
The interface that Sprite provides in gen-
is conceptually involved in every I/O op-
eral and to the file system in particular is
eration.
much like the one provided by UNIX. The
Fault tolerance. A novel feature of NFS file system appears as a single UNIX tree
is the stateless approach taken in the encompassing all files and devices in the
design of the servers. The result is resil- network, making them equally and trans-
iency to client, server, or network fail- parently accessible from every workstation.
ures. Should a client fail, it is not As with Locus, the location transparency is
necessary for the server to take any ac- complete; there is no way to discern a file’s
tion. Once caching was introduced, var- network location from its name. Sprite en-
ious patches had to be invented to keep forces UNIX semantics for share files.
the cached data consistent without mak- In spite of its functional similarity to
ing the server stateful. UNIX, the Sprite kernel was developed
Sharing semantics. NFS does not pro- from scratch. Oriented toward multipro-
vide UNIX semantics for concurrently cessing, the kernel is multithreaded. Syn-
open files. In fact, the current semantics chronization between the multiple threads

ACM Computing Surveys, Vol. 22, No. 4, December 1990


358 l E. Levy and A. Silberschatz
is based on monitorlike structures with are used during name lookups, then de-
many small locks protecting the shared scribe how the tables change dynamically.
data [Hoare 19741. Network integrat,ion is Each entry in a prefix table corresponds
based on a simple kernel-to-kernel RPC to one of the domains. It contains the path-
facility implemented on top of a special- name of the topmost directory in the do-
purpose network protocol. The technique main (that pathname is called the prefix for
used in the protocol is implicit acknowledg- the domain), the network address of the
ment, originally discussed in Birrel and server storing the domain, and a numeric
Nelson [ 19841. designator identifying the domain’s root
A unique feature of the Sprite file system directory for the storing server. This des-
is its interplay with the virtual memory ignator is an index into the server table of
system. Most versions of UNIX use a spe- open files; it saves repeating expensive
cial disk partition as a swapping area for name translation.
virtual memory purposes. In contrast,, Every lookup operation for an absolute
Sprite uses ordinary files (called backing pathname starts with the client searching
files) to store data and stacks of running its prefix table for the longest prefix match-
processes. The motivation for this design is ing the given file name. The client strips
that it simplifies process migration and en- the matching prefix from the file name and
ables flexibility and sharing of the space sends the remainder of the name to the
allocated for swapping. Backing files are selected server along with the designat,or
cached in the main memories of servers, from the prefix table entry. The server uses
just like any other file. It is claimed that this designator to locate the root directory
clients would be able to read random pages of the domain, then proceeds by usual
from a server’s (physical) cache faster than UNIX pathname translation for the re-
from a local disk, which means that a server mainder of the file name. If the server
with a large cache may provide better pag- succeeds in completing the translation, it
ing performance t.han from a local disk. The replies with a designator for the open file.
virtual memory and file system share the There are several cases in which the
same cache, which is dynamically parti- server does not complete the lookup. For
tioned according to their conflicting needs. instance, a pathname can descend down
Sprite allows the file cache on each ma- into a new domain. This can happen when
chine to grow and shrink in response to an entry for a domain is absent from the
changing demands of the machine’s virtual table and, as a result, the prefix of the
memory and file system. Among other fea- domain above the missing domain is
tures of Sprite are support for user LWPs the longest matching prefix. The selected
and a process migration facility, which is server cannot complete the pathname trav-
transparent both to users and the migrated ersal since it descends outside its domain.
process. The solution to this problem is to place a
marker to indicate domain boundaries (a
10.2 Looking Up Files with Prefix Tables mount point). The marker is a special kind
of file called a remote link. Similar to a
Sprite presents its user with a single file symbolic link, it.s content is a file name-
system hierarchy. The hierarchy is com- its own name in this case. When a server
posed of several subtrees called domains encounters a remote link, it returns the file
(the Sprite term for component unit), with name to the client.
each server providing storage for one or So far, the key difference from mappings
more domains. Each machine maintains a based on the UNIX mount mechanism is
server map called a prefix table, whose func- the initial step of matching the file name
tion is to map domains to servers [Welch against the prefix table instead of looking
and Ousterhout 19861. The mapping is built it up component by component. Systems
and updated dynamically by a broadcast (such as NFS and conventional UNIX)
protocol. We first describe how the tables that use a name lookup cache get a similar

ACM Computing Surveys, Vol. 22, No. 4, December 1990


Distributed File Systems 359
effect of avoiding the component-by- of the uses of this provision can be for
component lookup once the cache holds the the directory /usr/tmp, which holds tem-
appropriate information. Prefix tables are, porary files generated by many UNIX pro-
however, a unique mechanism mainly be- grams. Every workstation needs access to
cause of the way they evolve and change. /usr/tmp. But workstations with local
When a remote link is encountered by the disks would probably prefer to use their
server, it indicates that the client lacks an own disk for the temporary space. They can
entry for a domain-the domain whose re- set up their /usr/tmp domains for private
mote link was encountered. To obtain the use, with a network file server providing a
missing prefix information, a client broad- public version of the domain for diskless
casts a file name. A server storing that file clients. All broadcast queries for /usr/tmp
responds with the prefix table entry for this would be handled by the public server.
file, including the string to use as a prefix, A primitive form of read-only replication
the server’s address, and the descriptor cor- can also be provided. It can be arranged so
responding to the domain’s root. The client that servers storing a replicated domain
can then fill in the details in its prefix table. give different clients different prefix entries
Initially, each client starts with an empty (standing for different replicas) for the
prefix table. The broadcast protocol is in- same domain. As a result, the service load
voked to find the entry for the root domain. is divided among the servers as each rep-
More entries are added as needed; a domain lica serves a different set of clients. The
that has never been accessed will not same technique can be used for sharing
appear in the table. binary files by different hardware types
The server locations kept in the prefix of machines.
table are hints that are corrected when Since the prefix tables bypass part of the
found to be wrong. Hence, if a client tries director lookup mechanism, the permission
to open a file and gets no response from the checking done during lookup is bypassed
server, it invalidates the prefix table entry too. The effect is that all programs implic-
and issues a broadcast query. If the server itly have search permission along all the
has become available again, it responds to paths denoting prefixes of domains. If ac-
the broadcast and the prefix table entry is cess to a domain is to be restricted, it must
reestablished. This same mechanism also be restricted at the root of the domain or
works if the server reboots at a different below it.
network address or if its domains are mi-
grated to other servers. 10.3 Caching and Consistency
The prefix mechanism ensures that
whenever a server storing a domain is up, An important aspect of the Sprite file sys-
the domain’s files can be accessed regard- tem design is the extent to which it uses
less of the status of servers storing domains using caching techniques. Capitalizing on
that appear in the pathname of the accessed the large main memories and advocating
files. In essence, the built-in broadcast pro- diskless workstations, file caches are stored
tocol enables dynamic reconfiguration and incore. The same caching scheme is used to
a certain degree of robustness. Also, when avoid local disk accesses as well as to speed
a prefix for a domain exists in a client’s up remote accesses. The caches are orga-
table, a direct client-server connection is nized on a block basis. Blocks are currently
established as soon as the client attempts 4Kb. Each block in the cache is virtually
to open a file in that domain (in contrast addressed by the file designator and a block
to pathname traversal schemes). location within the file. Using virtual ad-
A machine with a local disk wishing to dresses instead of physical disk addresses
keep some local files private can accomplish enable clients to create new blocks in the
this by placing an entry for the private cache and locate any block without the file
domain in its prefix table and refusing to i-node being brought from the server. Cur-
respond to broadcast queries about it. One rently, Sprite does not use read-ahead to

ACM Computing Surveys, Vol. 22, No. 4, December 1990


360 . E. Levy and A. Silberschatz
speed up sequential read (in contrast ing their servers. Essentially, the servers
to NFS). are used as centralized control points for
A delayed-write approach is used to han- cache consistency. In order to fulfill this
dle file modification. A dirty block is not function, they must maintain state infor-
written through to the servers cache or the mation about open files.
disk until it is ejected from the cache or
30 s have elapsed since the block was last 10.4 Summary
modified. Hence, a block written on a client
machine will be written to the servers cache Since Sprite is currently under develop-
in at most 30 s and will be written to the ment its design is evolving. Some definite
server’s disk after an additional 30 s. characteristics of the system are, however,
Exact emulation of UNIX semantics is already evident;
one of Sprite’s goals. A hybrid cache vali- Extensive use of caching. Sprite is in-
dation method is used for this end. Files spired by the vision of diskless worksta-
are associated with a version number. The tions with huge main memories and
version number of a file is incremented accordingly relies heavily on caching.
whenever a file is opened in Write mode. The current design is fragile due to the
When a client opens a file, it obtains amount of the state data kept in-core by
the file’s current version number from the the servers. A server crash results in
server and compares this number to the aborting all processes using files on the
version number associated with the cached server. On the other hand, Sprite dem-
blocks for that file. If the version numbers onstrates the big merit of caching in main
are different, the client discards all cached memory-performance.
blocks for the file and reloads its cache Sharing semantics. Sprite sacrifices even
from the server when the blocks are needed. performance in order to emulate UNIX
Because of the delayed-write policy, the semantics. This decision eliminates the
server does not always have the current file possibility and benefits of caching in big
data. Servers handle this situation by keep- chunks.
ing track of the last writer for each file.
Prefix tables. There is nothing out of the
When a client other than the last writer
ordinary in prefix tables. Nevertheless,
opens the file, the server forces the last
for LAN-based file systems, prefix tables
writer to write all its dirty blocks back to
are a most efficient, dynamic, versatile,
the server’s cache. When a server detects
and robust mechanism for file lookup.
(during an Open operation) that a file is
The key advantages are the built-in
open on two or more workstations and at
facility for processing whole prefixes
least one of them is writing the file, it
of pathnames (instead of processing
disables client caching for that file (thereby
component by component) and the sup-
resorting to a remote service mode). All
porting broadcast protocol that allows
subsequent Reads and Writes go through
dynamic changes in the tables.
the server, which serializes the accesses.
Caching is disabled on a file basis, and the
disablement affects only clients with open 11. ANDREW
files. A substantial degradation of perfor-
mance occurs when caching is disabled. A Andrew is a distributed computing environ-
noncachable file becomes cachable again ment that has been under development
when it has been closed on all clients. A file since 1983 at Carnegie-Mellon University.
may be cached simultaneously by several The Andrew file system constitutes the un-
active readers. derlying information-sharing mechanism
This approach depends on the fact that among users of the environment. One of
the server is notified whenever a file is the most formidable requirements of An-
opened or closed. This prohibits perfor- drew is its scale-the system is targeted to
mance optimizations such as name caching span more than 5000 workstations. Since
in which clients open files without contact- 1983, Andrew has gone through design,

ACM Computing Surveys, Vol. 22, No. 4, December 1990


Distributed File Systems l 361

Local name space

Shared name space

Figure 7. Andrew’s name spaces.

prototype implementation, and refinement as an identical and location-transparent


phases. Our description concentrates on a file hierarchy. The local name space is the
recent version reported mainly in Howard root file system of a workstation from
et al. [1988]. It is interesting to examine which the shared name space descends
how the design evolved from the prototype (Figure 7). Workstations are required to
to the current version. An excellent account have local disks where they store their local
of this evolution along with a concise de- name space, whereas servers collectively
scription of the first prototype can be found are responsible for the storage and manage-
in Howard et al. [ 19881. ment of the shared name space. The local
In early 1987 Andrew encompassed about name space is small and distinct from each
400 workstations and 16 servers. Typically, workstation and contains system programs
the workstations were Sun’s and IBM RTs, essential for autonomous operation and
with local disks; the file servers were Sun’s better performance, temporary files, and
or Vax’s, with much larger disks. Section files the workstation owner explicitly
11.1 gives a brief overview of the file system wants, for privacy reasons, to store locally.
and introduces its primary architectural Viewed at a finer granularity, clients and
components. Sections 11.2, 11.3, and 11.4 servers are structured in clusters intercon-
discuss the shared name space structure, nected by a backbone LAN (Figure 8). Each
the strategy for implementing file opera- cluster consists of a collection of worksta-
tions, and various implementation details, tions, a representative of Vice called a clus-
respectively. ter server, and is connected to the backbone
by a router. The decomposition into clus-
il. 1 Overview
ters is primarily to address the problem
of scale. For optimal performance, work-
Andrew distinguishes between client ma- stations should use the server on their
chines (sometimes referred to just as work- own cluster most of the time, thereby mak-
stations) and dedicated server machines. ing cross-cluster file references relatively
Servers and clients alike run the UNIX infrequent.
4.2BSD operating system and are intercon- The file system architecture was moti-
nected by an internet of LANs. vated by consideration of scale, too. The
Clients are presented with a partitioned basic heuristic was to off-load work from
space of file names: a local name space and the servers to the clients, in light of the
a shared name space. A collection of dedi- common experience indicating that server’s
cated servers, collectively called Vice, pre- CPU is the system’s bottleneck [Lazowska
sents the shared name space to the clients et al. 19861. Following this heuristic, the

ACM Computing Surveys, Vol. 22, No. 4, December 1990


362 . E. Levy and A. Silberschatz
Backbone Ethernet Security. Special consideration was
given to security. The Vice interface is
considered the boundary of trustworthi-
---..1...--- ness since no user programs are executed
on Vice machines. Authentication and
secure transmission functions based on
the RPC paradigm, are provided as part
+-zq of communication package. After mutual
authentication, a Vice server and a client
communicate via encrypted messages.
Encryption is performed by hardware de-
ws- vices. Information about users and
groups is stored in a protection database
that is replicated at each server.
.--El Protection. Andrew provides access lists
for protecting directories and the regular
W8- UNIX bits for file protection. The access
lists mechanism is based on recursive
groups structure, similar to the registra-
-El tion database of Grapevine [Birrel et al.
19821.
Heterogeneity. Defining a clear interface
to Vice is a key for integration of diverse
workstation hardware and operating sys-
tem. To facilitate heterogeneity, some
Figure 8. Typical cluster in Andrew. files in the local /bin directory are sym-
bolic links pointing to machine-specific
key mechanism selected for remote file op- executable files residing in Vice.
erations is whole file caching. Opening a
file causes it to be cached, in its entirety,
in the local disk. Reads and writes are 11.2 Shared Name Space
directed to the cached copy without involv-
Andrew’s shared name space is constituted
ing the servers. Under certain circum-
of component units called volumes. An-
stances, the cached copy can be retained
drew’s volumes are unusually small com-
for later use.
ponent unit. Typically, they are associated
Entire file caching has many merits,
with the files of a single user. Few volumes
which are described subsequently. This de-
reside within a single disk partition and
sign cannot, however, efficiently accom-
may grow (up to a quota) and shrink in
modate remote access to very large files
size. Volumes are joined together by a
(i.e., above a few megabytes). Thus, a mechanism similar to the mount mecha-
separate design will have to address the
nism. The granularity difference is signifi-
issue of usage of large databases in the
cant, since in UNIX only an entire disk
Andrew environment. Additional issues
partition (containing a file system) can be
in Andrew’s design are briefly noted:
mounted. Volumes are a key administrative
l User mobility. Users are able to access unit and play a vital role in identifying and
any file in the shared name space from locating an individual file.
any workstation. The only noticeable ef- A Vice file or directory is identified by a
fect of a user accessing files not from the low-level identifier called fid. Each Andrew
usual workstation would be some initial directory entry maps a pathname compo-
degraded performance due to the caching nent to a fid. A fid has three equal length
of files. components: a volume number, a vnode

ACM Computing Surveys, Vol. 22, No. 4, December 1990


Distributed File Systems l 363
number, and a uniquifier. The vnode num- This key distinction has far-reaching ram-
ber is used as an index into an array con- ifications on performance as well as on
taining the i-node of files in a single semantics of file operations. The operating
volume. The uniquifier allows reuse of system on each workstation intercepts file
vnode numbers, thereby keeping certain system calls and forwards them to a user-
data structures compact. Fid’s are location level process on that workstation. This pro-
independent; therefore, file movements cess, called Venus, caches files from Vice
from server to server do not invalidate when they are opened and stores modified
cached directory contents. copies of files back on the servers from
Location information is kept on a volume which they came when they are closed.
basis in a volume location database repli- Venus may contact Vice only when a file is
cated on each server. A client can identify opened or closed; reading and writing in-
the location of every volume in the system, dividual bytes of a file are performed di-
querying this database. It is the aggregation rectly on the cached copy and bypass
of files into volumes that makes it possible Venus. As a result, writes at some sites are
to keep the location database at a manage- not immediately visible at other sites.
able size. Caching is further exploited for future
To balance the available disk space and opens of the cached file. Venus assumes
use of servers, volumes need to be migrated that cached entries (files or directories) are
among disk partitions and servers. When a valid unless notified otherwise. Therefore,
volume is shipped to its new location, its Venus need not contact Vice on a file open
original server is left with temporary for- in order to validate the cached copy. The
warding information so the location data- mechanism to support this policy is called
base need not be updated synchronously. Callback, and it dramatically reduces the
While the volume is being transferred, the number of cache validation requests re-
original server still may handle updates, ceived by servers. It works as follows: When
which are later shipped to the new server. a client caches a file or a directory, the
At some point the volume is briefly disabled server updates its state information record-
to process the recent modifications, then ing this caching. We say that the client has
the new volume becomes available again at a callback on that file. The server notifies
the new site. The volume movement oper- the client before allowing a modification to
ation is atomic; if either server crashes the the file by another client. In such a case,
operation is aborted. we say that the server removes the callback
Read-only replication at the granularity on the file for the former client. A client
of an entire volume is supported for system- can use a cached file for open purposes only
executable files and seldom-updated files in when the file has a callback. Therefore, if
the upper levels of the Vice name space. a client closed a file after modifying it, all
The volume location database specifies the other clients caching this file lose their
server containing the only read-write copy callbacks. When these clients open the file
of a volume and a list of read-only replica- later, they have to get the new version from
tion sites. the server.
Reading and writing bytes of a file are
done directly by the kernel without Venus
11.3 File Operations and Sharing Semantics
intervention on the cached copy. Venus
The fundamental architectural principle in regains control when the file is closed, and
Andrew is the caching of entire files from if the file has been modified locally, Venus
servers. Accordingly, a client workstation updates the file on the appropriate server.
interacts with Vice servers only during Thus, the only occasions in which Venus
opening and closing of files, and even this contacts Vice servers are on opening files
is not always necessary. No remote inter- that either are not in the cache or have had
action is caused by reading or writing files their callbacks revoked, and on Close-of-
(in contrast to the remote service method). writing sessions.

ACM Computing Surveys, Vol. 22, No. 4, December 1990


364 . E. Levy and A. Silberschatz

Basically, Andrew implements session I 1.4 Implementation


semantics. The only exceptions are file op-
User processes are interfaced to a UNIX
erations other than the primitive Read and
kernel with the usual set of system calls.
Write (such as protection changes at the
The kernel is modified slightly to detect
directory level), which are visible every-
references to Vice files in the relevant
where on the network immediately after
operations and to forward the requests
the operation completes.
to the user-level Venus process at the
In spite of the callback mechanism, a
workstation.
small amount of cached validation traffic
Venus carries out pathname translation
is still present, usually to replace callbacks
component by component as described ear-
lost because of machine or network failures.
lier. It has a mapping cache that associates
When a workstation is rebooted, Venus
volumes of server locations to avoid server
considers all cached files and directories
interrogation for an already known volume
suspect and generates a cache valida-
location. The information in this cache is
tion request for the first use of each such
treated as a hint. If a volume is not present
entry.
in this cache or if the location information
The callback mechanism forces each
turned out to be wrong, Venus contacts a
server to maintain callback information
server, requests the location information,
and each client to maintain validity infor-
and enters this information into the map-
mation. If the amount of callback infor-
ping cache. When a target file is found and
mation maintained by a server is excessive,
cached, a copy is created on the local disk.
the server can break callbacks and reclaim
Venus then returns to the kernel, which
some storage by unilaterally notifying
opens the cached copy and returns its han-
clients and revoking the validity of their
dle to the user process.
cached files. There is a potential for incon-
The UNIX file system is used as a low-
sistency if the callback state maintained by
level storage system for both servers and
Venus gets out of sync with the correspond-
clients. The client cache is a local directory
ing state maintained by the servers.
on the workstation’s disk. Within this di-
Venus also caches contents of directories
rectory are files whose names are place
and symbolic links for pathname transla-
holders for cache entries. Both Venus and
tion. Each component in the pathname is server processes access UNIX files directly
fetched, and a callback is established for it by their i-nodes to avoid the expensive
if it is not already cached or if the client pathname translation routine (namei).
does not have a callback on it. Lookups are Since the internal i-node interface is not
done locally by Venus on the fetched direc- visible to user-level processes (both Venus
tories using fid’s. There is no forwarding of and server processes are user-level pro-
requests from one server to another. At the cesses), an appropriate set of additional
end of a pathname traversal all the inter- system calls was added.
mediate directories and the target file are Venus manages two separate caches-
in the cache with callbacks on them. Future one for status and the other for data. Venus
open calls to this file will involve no net- uses a simple least-recently used algorithm
work communication at all, unless a call- to keep each of them bounded in size. When
back is broken on a component of the a file is flushed from the cache, Venus
pathname. notifies the appropriate server to remove
The only exception to the caching policy the callback for this file. The status cache
are modifications to directories that are is kept in virtual memory to allow rapid
made directly on the server responsible for servicing of system calls that ask for status
that directory for reasons of integrity. information (e.g., the UNIX stat call). The
There are well-defined operations in the data cache is resident on the local disk, but
Vice interface for such purposes. Venus the UNIX I/O buffering mechanism does
reflects the changes in its cached copy to some caching of disk blocks in memory that
avoid refetching the directory. is transparent to Venus.

ACM Computing Surveys, Vol. 22, No. 4, December 1990


Distributed File Systems l 365
A single user-level process on each file this burden from servers. The penalty for
server services all file requests from clients. choosing this strategy and the corre-
This process uses a LWP package with sponding design includes maintaining a
nonpreemptabie scheduling to service lot of state data on the servers to support
many client requests concurrently. The the callback mechanism and specialized
RPC package is integrated with the LWP, sharing semantics.
thereby allowing the file server to be con- Sharing semantics. Andrew’s semantics
currently making or servicing one RPC per are simple and well defined (in contrast
lightweight process. RPC is built on top of to NFS, for instance, where effects of
a low-level datagram abstraction. concurrent accesses are time dependent).
Whole file transfer is implemented as a They are not, however, UNIX semantics.
side effect of RPC call. There is an RPC Basically, Andrew’s semantics ensure
connection per client, but there is no a that a file’s updates are visible across the
priori binding of LWPs to these connec- network only after the file has been
tions. Instead, a pool of LWPs service client closed.
requests on all connections. The use of a Component units and location mapping.
single, user-level, server process allows Ve- Andrew’s component unit-the vol-
nus to maintain in its address space caches ume-is of relatively fine granularity and
of data structures needed for its operation. exhibits some primitive mobility capabil-
On the other hand, a single server process ities. Volume location mapping is imple-
crash has the disastrous effect of paralyzing mented as a complete and replicated
this particular server. mapping at each server.
A more recent version of Andrew differs
slightly from the version described here. Results of a thorough series of perfor-
Instead of whole-file caching, caching in mance experimentation with Andrew are
chunks of 64Kb, is used. Also, the sharing presented in Howard et al [1988]. The re-
semantics were modified slightly. Updates sults confirm the current design predic-
are still immediately invisible. The updat- tions. That is, the desired effects on server
ing client can, however, explicitly request CPU use, network traffic, and overall time
that the changes become visible even to needed to perform remote file operations
remote clients having the file already open. were obtained, in particular under severe
server load. The performance experiments
11.5 Summary
include a benchmark comparison with NFS
in which Andrew demonstrated its superi-
We review the highlights of the Andrew file ority regarding the recently mentioned cri-
system: teria, again especially for severe server load.
l Name space and service model. Andrew
explicitly distinguishes among local and 12. OVERVfEW OF RELATED WORK
shared name spaces, as well as among This paper focused on several concepts and
clients and dedicated servers. Clients systems without exhausting the area of
have a small and distinct local name DFSs. Consequently, many aspects and
space and can access the shared name systems were omitted. In this section we
space managed by the servers. therefore cite references that complement
l Scalability. Andrew is distinguished by this paper.
its scalability. The strategy adopted to Many studies of typical properties of file
address scale is whole file caching (to and characteristics of file accesses have
local disks) in order to reduce servers been done over the years [Ousterhout et al.
load. Servers are not involved in reading 1985; Satyanarayanan 1981; Smith 19811.
and writing operations. The callback These empirical results have vast impact
mechanism was invented to reduce the on the design of a DFS. Material on another
number of validity checks. Performing subject that was not covered in this survey,
pathname traversals by clients off-loads namely security and authentication, can be

ACM ComDutinz Survevs, Vol. 22, No. 4, December 1990


366 l E. Levy and A. Silberschatz
found in Needham and Schroeder [1978] evaluate the feasibility of file migration
and Satyanarayanan [ 19891. as a remote access method. Locating a
A detailed survey of mainly centralized migratory file is based on a primitive
file servers is found in Svobodova [ 19841. mechanism of associating the file’s owner
The emphasis is on support of atomic with a list of possible machines where the
transactions, not on location transparency files can be located. It emphasizes that
and naming. A tutorial on distributed op- file access patterns must exhibit local-
erating systems is presented in Tanenbaum ity to make file migration an attractive
and Van Renesse [1985]. There, a distrib- remote access method.
uted operating system is defined and issues Ibis [Tichy and Ruan 19841. Ibis is the
like communication primitives and protec- successor of Stork. It is a user-level ex-
tion are discussed. These two surveys in- tension of UNIX. Remote file names are
clude an extensive bibliography to a variety prefixed with their host name and can
of distributed systems. appear in system calls as well as in shell
Next, we give a concise overview of a few commands. The replication scheme was
noteworthy DFSs that were not surveyed described in Section 5.3. Low-level, struc-
in this paper. ture, but location-dependent names are
l Roe [Ellis and Floyd 1983; Floyd 19891. used. One of the parts of the structured
Roe presents a file as an abstraction hid- name designates the machine that cur-
ing both replication and location details. rently stores the file. These names render
Files are migrated to achieve balancing file migration a very expensive operation,
of systemwide disk storage allocation and since all directories containing the name
also as a remote access method. Consis- of the migrated file must be updated.
tency of replicated files is obtained by Apollo Domain [Leach et al., 1982, 19851.
a weighted voting algorithm [Gifford The Domain system is a commercial
19791. product featuring a collection of powerful
. Eden [Almes et al, 1983; Black 1985; workstations connected by a high-speed
Jessop et al. 19821. A radically different LAN. An object-oriented approach is
approach is adopted for the experimental taken. Files are objects, and as such they
Eden file system from the University of may be of different types. Accordingly, it
Washington. The system is based on the is possible to construct file operations
object-oriented and capability-based ap- that are customized for a particular file
proaches [Levy 19841. A file is a dynamic type. All the objects in the system are
object that can be viewed as an instance named by a networkwide, unique, low-
of an abstract data type. It includes pro- level, location-independent name, called
cesses that satisfy requests oriented to a UID. Objects are organized in hierar-
the file (i.e., there is no separation of chical, UNIX-like, directories that asso-
passive data files and active server pro- ciate textual names with UIDs. No global
cesses). A kernel-supported storage state information is kept on object loca-
system provides primitives for check- tions. Instead, an interesting location al-
pointing the representation of an object gorithm, based on heuristics (hints) for
to secondary storage, copying it, or mov- guessing the object’s location, is used. For
ing it from machine to machine. Eden instance, one helpful heuristic is to as-
files can be replicated, can be migrated, sume that objects created at the same
are named in a location-independent machine are likely to be located together.
manner, and can support atomic trans- A unique feature of Domain is the way
actions. More material on migratory ob- objects are accessed once located. Objects
jects can be found in the context of the are mapped directly onto clients’ address
Emerald project, conducted in the same spaces and accessed via virtual memory
university [Jul et al. 19871. paging. In terms of remote access meth-
l Stork [Paris and Tichy 19831. Stork is ods, this amounts to caching in the gran-
an experimental file system designed to ularity of pages. Write-through policy

ACM Computing Surveys, Vol. 22, No. 4, December 1990


Distributed File Systems l 367

is used for modification, and client- off-loading work from servers to clients and
initiated approach is used for validation structuring a system as a collection of clus-
of cached data. ters are two sound scalability strategies.
Clusters should be as autonomous as pos-
13. CONCLUSIONS sible and should serve as a modular building
block for an expandable system. A chal-
In this paper we presented the basic con- lenging aspect of scale that might be of
cepts underlying the design of a distributed interest for future designs is the exten-
file system and surveyed five of the most sion of the DFS paradigm over WANs.
prominent systems. A comparison of the Such an extended DFS would be character-
systems is presented in Table 1. A crucial ized by larger latencies and higher failure
observation, based on the assessment of probabilities.
contemporary DFSs, is that the design of a A factor that is certain to be prominent
DFS must depart from approaches devel- in the design of future DFSs is the available
oped for conventional file systems. Basing technology. It is important to follow tech-
a DFS on emulation of a conventional file nological trends and exploit their potential.
system might be a transparency goal, but it Some imminent possibilities are as follows:
certainly should not be an implementation
strategy. Extending mechanisms developed Large main memories. As main memo-
for conventional file systems over a net- ries become larger and less expensive,
work is a strategy that disregards the main-memory caching (as exemplified in
unique characteristics of a DFS. Sprite) becomes more attractive. The re-
Supporting this claim is the observation wards in terms of performance can be
that a loose notion of sharing semantics is exceptional.
more appropriate for a DFS than conven- Optical disks. Optical storage technology
tional UNIX semantics. Restrictive seman- has an impact on file systems in general
tics incur a complex design and intolerable and hence on DFSs in particular, too.
overhead. A provision to facilitate restric- Write-once optical disks are already
tive semantics for database applications available [Fujitani 19841. Their key fea-
may be offered as an option. Consequently, tures are very large density, slow access
UNIX compatibility should be sacrificed time, high reliability, and nonerasable
for the sake of a good DFS design. In this writing. This medium is bound to become
respect, the approach used in Andrew to on-line tertiary storage and replace tape
the semantics of sharing prove superior to devices. Rewritable optical disks are be-
those used in Locus and NFS. coming available and might replace mag-
Another area in which a fresh approach netic disks altogether.
is essential is the server process architec-
ture. There is a wide consensus that some Optical fiber networks. A change in the
form of LWPs is more suitable than tradi- entire approach to the remote access
tional processes for efficiently handling problem can be justified by the existence
high loads of service requests. of these remarkably fast communication
It is difficult to present concrete guide- networks. The concept of local disk is
lines in the context of fault tolerance and faster may be rendered obsolete.
scalability, mainly because there is not Nonvolatile RAMS. Battery-backed
enough experience in these areas. It is clear, memories can survive power outage,
however, that distribution of control and thereby enhancing the reliability of
data as presented in this paper is a key main-memories caches. A large and reli-
concept. User convenience calls for hiding able memory can cause a revolution in
the distributed nature of such a system. As storage techniques. Still, it is questiona-
we pointed out in Section 2, the additional ble whether this technology is sufficient
flexibility gained by mobile files is the next to make main memories as reliable as
step in the spirit of distribution and trans- disks because of the unpredictable con-
parency. Based on the Andrew experience, sequences of an operating system crash

ACM Computing Surveys, Vol. 22, No. 4, December 1990


368 . E. Levy and A. Silberschatz
Table 1. Comparison of Surveyed Systems
UNIX United Locus
Background Interconnecting a set of loosely A highly reliable distributed op-
coupled UNIX systems without erating system providing multi-
modifying the kernel. ple dimensions of transparency
and is UNIX compatible.

Naming scheme Single pseudo-UNIX tree. No- Single UNIX tree, hiding both
ticeable machine boundaries. All replication and location.
pathnames are relative (by the
‘ .’ syntax). Independence of
component systems. Recursive
structuring.

Component unit Entire UNIX hierarchy. A logical filegroup (UNIX file


system).

User mobility Not supported. Supported.

Client-server Each machine can be both. A triple: US, SS, CSS. Every file
group has a CSS that selects SS
and synchronizes accesses. Once
a file is open, direct US-SS
protocol.

Remote-access method Emulation of conventional Once a file is open, accesses are


UNIX across the network. served by caching.

Caching Emulation of UNIX buffering. Block caching similar to UNIX


buffering. A token scheme for
cache consistency. Closing a file
commits it on the server.

Sharing semantics Complete UNIX semantics,


including sharing of file offset.

Pathname traversal The pathname translation re- US reads each directory and per-
quest is forwarded from machine forms lookup itself. Given a tile
to machine. group number, the CSS is found
replicated on all machines in the
logical mount table. The CSS
picks SS.

Reconfiguration, file mobility Impossible to move a file without Because of replication, servers
changing its name. No dynamic can be taken off-line or fail with-
reconfiguration. out disturbance. Directory hier-
archy can be changed by
mounting/unmounting.

Availability Availability of a file means the


CSS and SS are available. Each
component in the file’s path-
name must be available for the
file to be opened. The primary
copy must be available for a
Write.

ACM Computing Surveys, Vol. 22, No. 4, December 1990


Distributed File Systems 369

Table l-Continued
NFS Sprite Andrew
A network service so that inde- Designed for an environment Designed as the sharing mecha-
pendent workstations would be consisting of diskless worksta- nism of a large-scale system for a
able to share remote files trans- tions with huge main memories university campus.
parently. interconnected by a LAN.

Each machine has its own view Single UNIX tree; hiding Private name spaces and one
of the global name space. location. UNIX tree for the shared name
space. The shared tree descends
from each local name space.

A directory within an exported A domain (UNIX file system). A volume (typically, all files of a
file system can be remotely single user).
mounted.

Potential for support exists; Supported. Fully supported.


demands certain configuration.

Every machine can be both. Di- Typically, clients are diskless Clustering: Dedicated servers per
rect client-server relationship is and servers are machines with cluster.
enforced. disks.

Remote service mixed with block Block caching in main memory. Whole file caching in local disks.
caching for service. In case of concurrent writes,
switch to Remote Service.

Block caching similar to UNIX Block caching similar to UNIX Read and Write are served
buffering. Client checks validity buffering. Delayed-write policy. directly by the cache without
of cached data on each open. Client checks validity of cached server involvement. Write-on-
Delayed-write policy. data on each open. Server dis- close policy. Server-initiated
ables caching when a file is approach for cache validation
opened in conflicting modes. (callback), hence no need to
check on each open.

Not UNIX semantics. Timing- UNIX semantics. Session semantics.


dependent semantics.

Lookups are done remotely for Prefix tables mechanism. Inside Client caches each directory and
each pathname component, but a domain, lookup is done by performs lookup itself. Given a
all are initiated from the client. server. volume number, the server is
A lookup cache for speedup. found in a volume location data-
base replicated on each server.
Parts of this database are cached
on each machine.

Mount/unmount can be done Broadcast protocol supports Volume migration is supported.


dynamically by superuser for dynamic reassignment of
each machine. domains to servers.

In case of cascading mount, each If a server of a file is available, A client has to have a connection
server along the mount chain has the file is available regardless of to a server, and each pathname
to be available for a file to be the state of other servers (along component must be available.
available. the pathname).

ACM Computing Surveys, Vol. 22, No. 4, December 1990


370 l E. Levy and A. Silberschatz
Table l-Continued
UNIX United Locus
Other fault tolerance issues A file is committed on close. The
primary copy is always up to
date. Other replicas may have
older (but not partially modified)
versions.

Scalability issues Recursive structuring. Replicated mount table on each


site and CSS for a file group are
major problems.

Implementation strategy, archi- UNIX kernel kept intact. Con- Extensive UNIX kernel modifi-
tecture nection layer intercepts remote cation. Kernel is pushed into the
calls. User-level daemons for- network. Some kernel LWP for
ward and service remote opera- remote services. Structured, low-
tions. A spawner process creates level, location-independent file
a server process per user that ac- identifiers are used.
cesses files using file descriptors.

Networking Suitable for arbitrary internet- LAN


work topology.

Communication protocol RPC Specialized low-level protocols


for each operation.

Special features Replication (primary copy).


Atomic update by shadow pag-
ing.

Main advantage Original UNIX kernel. Internet- Performance, because of kernel


working capabilities. implementation. Fault tolerance,
due to replication, atomic up-
date, and other features. UNIX
compatibility.

Main disadvantage Not fully transparent naming. Complicated design and large
kernel. Unscalable features.
Complex recovery due to main-
tained state.

[Ousterhout 19891. Other problems of REFERENCES


relatively slow access time and limited ALMES, G. T., BLACK, A. P., LAZOWSKA, E. D., AND
size still plague this technology. NOE, J. D. 1983. The Eden system: A technical
review. IEEE Trans. Softw. Eng. 11, 1 (Jan.),
43-59.
ACKNOWLEDGMENTS BARAK, A., AND KORNATZKY, Y. 1987. Design Prin-
ciples of Operating Systems for Large Scale Mul-
This work was partially supported by NSF grant IRI ticomputers. IBM Research Division, T. J.
8805215 and by the Texas Advanced Research Pro- Watson Research Center, Yorktown Heights,
gram under grant No. 4355. We would like to thank New York. RC 13220 (#59114).
the anonymous reviewers and the editor-in-chief, BARAK, A., AND LITMAN, A. 1985. MOS: A multi-
Salvatore March, for their constructive comments that computer distributed operating system. Softw.
led us to improve the clarity of this paper. Prac. Exper. 15, 8 (Aug.), 725-737.

ACM Computing Surveys, Vol. 22, No. 4, December 1990


Distributed File Systems l 371
Table l-Continued
NFS Sprite Andrew
Complete stateless service. Idem- No guarantees because of the Not dealt with fully yet. Stateful
potent operations. delayed write policy. Stateful service.
service.

Not intended for very large-scale Because broadcast is relied upon Reducing server load and cluster-
systems. and of server involvement in ing are the main strategy. Repli-
operations there might be a cated location database might be
problem. a problem.

Three layers: UNIX system call New kernel based on multi- Augmenting UNIX kernel with
interface, VFS interface to sepa- threading, intended for multi- user-level processes: Venus at
rate file system implementation processor workstation. each client, and a single server
from operations, and NFS layer. process on each server using
Independent specifications for nonpreemptable LWPs. Struc-
mount and NFS protocols. The tured, low-level, location-
Current implementation is ker- independent file identifiers are
nel based. used.

LAN LAN Cluster structure, with a router


per cluster. All communication is
based on high bandwidth LAN
technology.

RPC and XDR on top of RPC on top of special-purpose RPC on top of datagram proto-
UDP/IP (unreliable datagram). network protocol. col. Whole file transfer as a side
effect.

Stateless service. Regular files used as swapping Authentication and encryption


area. Interaction between file built into the communication
system and virtual memory protocol. Access list mechanism
system. for protection. Limited read-only
replication.

Fault tolerance, because of state- Performance due to main mem- Ability to scale up gracefully.
less protocol. Implementation- ory caching. Clear and simple consistency
independent protocols, ideal for semantics.
heterogeneous environment.

Unclear semantics. Performance Questionable scalability. Not Fault tolerance issues due to
improvements obscure clean much in terms of fault tolerance. maintained state.
design.

BARAK, A., MALKI, D., AND WHEELER, R. 1986. cise in distributed computing. Commun. ACM 25,
AFS, BFS, CFS . or Distributed File Systems 4 (Apr.), 260-274.
for UNIX. In European UNIX Users Group Con-
BIRREL, A. D., AND NELSON, B. J. 1984.
ference Proceedings (Sept. 22-24, Manchester,
Implementing remote procedure calls. ACM
U.K.). EUUG, pp. 461-472.
Trans. Comput Syst. 2, 1 (Feb.), 39-59.
BARAK, A., AND PARADISE, 0. G. 1986. MOS: Scal-
ing up UNIX. In Proceedings of USENIX 1986 BLACK, A. P. 1985. Supporting distributed applica-
Summer Conference. USENIX Association, tions: Experience with Eden. In Proceedings of
Berkeley, California, pp. 414-418. the 10th Symposium on Operating Systems Prin-
BERNSTEIN, P. A., HADZILACOS, V., AND GOODMAN, ciples (Orcas, Island, Wash., Dec. l-4). ACM,
N. 1987. Concurrency Control and Recouery an New York, pp. 181-193.
Database Systems. Addison-Wesley, Reading, BROWNBRIDGE, D. R., MARSHALL, L. F., AND RAN-
Mass. DELL, B. 1982. The Newcastle connection or
BIRREL, A. D., LEVIN, R., NEEDHAM, R. M., AND UNIXes of the world unite! Softw. Prac. Exper.
SCHROEDER, M. D. 1982. Grapevine: An exer- 12, 12 (Dec.), 1147-1162.

ACM Computing Surveys, Vol. 22, No. 4, December 1990


372 . E. Levy and A. Silberschatz
CHERITON, D. R., AND ZWAENEPOEL, W. 1983. The tation: An Advanced Course, G. Goos and
distributed V kernel and its performance for disk- J. Hartmanis, Eds., Springer-Verlag, Berlin,
less workstations. In Proceedings of the 9th Chap. 11, pp. 246-265.
Symposium on Operating Systems Principles LAMPSON, B. W. 1983. Hints for computer system
(Bretton Woods, N.H., Oct.). ACM, New York, designers. In Proceedings of the 9th Symposium
pp. 128-140. on Operating Systems Principles (Bretton Woods,
DAVIDSON, S. B., GARCIA-M• LINA, H., AND SKEEN, N.H., Oct.). ACM, New York, pp. 33-48.
D. 1985. Consistency in partitioned networks. LAZOWSKA, E. D., LEVY, H. M., AND ALMES, G. T.
ACM Comput. Suru. 17, 3 (Sept.), 341-370. 1981. The architecture of the Eden system. In
DION, J. 1980. The Cambridge file server. ACM Proceedings of the 8th Symposium on Operating
SZGOPS, Oper. Syst. Reu. 14, 4 (Oct.), 26-35. Systems Principles (Asilomar, Calif., Dec.). ACM,
DOUGLIS, F., AND OUSTERHOUT, J. K. 1989. New York, pp. 148-159.
Beating the l/O bottleneck. ACM SIGOPS, Oper. LAZOWSKA, E. D., ZAHORJAN, J., CHERITON, D., AND
Syst. Reu. 23, 1 (Jan.), 11-28. ZWAENEPOEL, W. 1986. File access perfor-
ELLIS, C. S., AND FLOYD, R. A. 1983. The ROE file mance of diskless workstations. ACM Trans.
system. In Proceedings of the 3rd Symposium on Comput. Syst. 4, 3 (Aug.), 238-268.
Reliability in Distributed Software and Database LEACH, P. J., STUMP, B. L., HAMILTON, J. A., AND
Systems (Clearwater Beach, Florida, October LEVINE, P. H. 1982. UlDs as internal names in
17-19). IEEE, New York. a distributed file system. In Proceedings of the 1st
FLOYD, R. 1989. Transparency in distributed file Symposium on Principles of Distributed Comput-
systems. Tech. Rep. 272, Department of Com- ing (Ottawa, Ontario, Canada, Aug. 18-20). ACM,
puter Science, University of Rochester. New York, pp. 34-41.
FUJITANI, L. 1984. Laser optical disks: The coming LEACH, P. J., LEVINE, P. H., HAMILTON, J. A.,
revolution in on-line storage. Commun. ACM 27, STUMP, B. L. 1985. The file system of an inte-
6 (June). grated local network. In Proceedings of the ACM
GIFFORD, D. 1979. Weighted voting for replicated Computer Science Conference (New Orleans,
data. In Proceedings of the 7th Symposium on Mar.). ACM, New York.
Operating Systems Principles (Pacific Grove, LEVY, H. M. 1984. Capability Based Computer Sys-
Calif. Dec. 10-12). ACM, New York, pp. 150-159. tems. Digital Press, Bedford, Mass.
HILL M., EGGERS, S., LARUS, J., TAYLOR, G., MCKUSICK, M. K., JOY, W. N., LEFFER, S. J., FABRY,
ADAMS, G., BOSE, B. K., GIBSON, G., R. S. 1984. A fast file system for UNIX. ACM
HANSEN, P., KELLER, J., KONG, S., LEE, C., Trans. Comput. Syst. 2, 3 (Aug.), 181-197.
LEE, D., PENDLETON, J., RITCHIE, S., WOOD, D.,
ZORN. B.. HILFINGER. P.. HODGES. D.. KATZ. R.. MITCHELL, J. G. 1982. File servers for local area
OUST~R~OUT, J., AND ‘PATTER&N,’ D. 1986: networks. Lecture Notes, Course on Local Area
Design decisions in SPUR. IEEE Comput. 19, 11 Networks, University of Kent, Canterbury,
(Nov.), 8-22. England, pp. 83-114.
HOARE, C. A. R. 1974. Monitors: An operating sys- MORRIS, J. H., SATYANARAYANAN, M., CONNEK,
tem structuring concept. Commun. ACM 17, 10 M. H., HOWARD, J. H., ROSENTHAL, D. S. H.,
(Oct.), 549-557. AND SMITH, F. D. 1986. Andrew: A distributed
personal computing environment. Commun.
HOWARD, J. H., KAZAR, M. L., MENEES, S. G., ACM 29, 3 (Mar.), 184-201.
NICHOLS, D. A., SATYANARAYANAN, M., AND
SIDEBOTHAM, R. N. 1988. Scale and perfor- NEEDHAM, R. M. HERBERT, A. J. 1982. The Cam-
mance in a distributed file system. ACM Trans. bridge Distributed Computing System, Addison
Comput. Syst. 6, 1 (Feb.), 55-81. Wesley, Reading, Mass.
JESSOP, W. H., JACOBSON, D. M., NOE, J. D., NEEDHAM, R. M., AND SCHROEDER, M. D. 1978.
BAER, J. L., AND Pu, C. 1982. The Eden trans- Using encryption for authentication in large net-
action based file system. In Proceedings of the works of computers. Commun. ACM 21, 12 (Dec.
2nd Symposium onkeliability in Distributed Soft- 1978).
ware and Databases Systems (July). IEEE, New NELSON, M., WELCH, B., AND OUSTERHOUT, J. K.
York, pp. 163-169. - 1988. Caching in the Sprite network file system.
JUL, E., LEVY, H. M., HUCHINSON, N., AND BLACK, ACM Trans. Comput. Syst. 6, 1 (Feb.).
A. 1987. Fine grain mobility in the Emerald OUSTERHOUT, J. K., DA COSTA, H., HARISSON, D.,
system (extended abstract). In Proceedings of the KUNZE, J. A., KUPFLER, M., AND THOMPSON,
11 th Symposium on Operating Systems Principles, J. G. 1985. A trace-driven analysis of the UNIX
(Austin, Texas, November). ACM, New York. 4.2 BSD file system. In Proceedings of the Z.Oth
KEPECS, J. 1985. Light weight processes for UNIX Symposium on Operating Systems Principles
implementation and applications. In Proceedings (Orcas Island, Wash., Dec. l-4). ACM, New York,
of Usenix 1985 Summer Conference. pp. 15-24.
LAMPSON, B. W. 1981. Atomic transactions. In Dis- OUSTERHOUT, J. K., CHERENSON, A. R., DOUGLIS,
tributed Systems-Architecture and Implemen- F., NELSON, M. N., AND WELCH, B. B. 1988.

ACM Computing Surveys, Vol. 22, No. 4, December 1990


Distributed File Systems l 373
The Sprite network operating system. IEEE SUN MICROSYSTEMS, INC. 1988. Network Program-
Comput. 21, 2 (Feb.), 23-36. ming, Sun Microsystems, Part Number: 800-
PARIS, J. F., AND TICHY, W. F. 1983. Stork: An 1779-10, Revision A, of 9 May 1988.
experimental migrating file system for computer SVOBODOVA, L. 1984. File servers for network based
networks. In Proceedings IEEE INFCOM. IEEE, distributed systems. ACM Comput. Sure. 26, 4
New York, pp. 168-175. (Dec.), 353-398.
POPEK, G., AND WALKER, B. Eds. 1985. The LOCUS TANENBAUM, A. S., AND VAN RENESSE, R.
Distributed System Architecture. MIT Press, 1985. Distributed operating systems. ACM
Cambridge Mass. Comput. Suru. 17, 4 (Dec.) 419-470.
POSTEL, J. 1980. User datagram protocol. RFC-768. TERRY, D. B. 1987. Caching hints in distributed
Network Information Center, SRI. systems. IEEE Trans. Softw. Eng. SE-13, 1
QUARTERMAN, J. S., SILBERSCHATZ, A., AND PETER- (Jan.), 48-54.
SON, J. L. 1985. 4.2 and 4.3 BSD as examples TEVANIAN, A., RASHID, R., GOLUB, D., BLACK, D.,
of the UNIX system. ACM Comput. Suru. 17, 4 COOPER, E., AND YOUNG, M. 1987. Mach
(Dec.). threads and the UNIX kernel: The battle for
RANDELL, B. 1983. Recursively structured distrib- control. In Proceedings of USENIX 1987 Summer
uted computing systems. In Proceedings of the Conference. USENIX Association, Berkeley,
3rd Symposium on Reliability in Distributed Soft- California.
ware and Database Systems (Clearwater Beach, TICHY, W. F., AND RUAN, Z. 1984. Towards a dis-
Fla., Oct. 17-19). IEEE, New York, pp. 3-11. tributed file system. In Proceedings of Usenix
RITCHIE, D. M., AND THOMPSON, K. 1974. The 1984 Summer Conference (Salt Lake City, Utah),
UNIX time sharing system. Commun. ACM 19, 7 pp. 87-97.
(Jul.), 365-375. WALKER, B., POPEK, J., ENGLISH, R., KLINE, C.,
SANDBERG, R., GOLDBERG, D., KLEIMAN, S., WALSH, THIEL, G. 1983. The LOCUS distributed oper-
D., AND LYONE, B. 1985. Design and implemen- ating system. ACM SIGOPS, Oper. Syst. Rev. 17,
tation of the Sun network file system. In Pro- 5 (Oct.), 49-70.
ceedings of Usenix 1985 Summer Conference WEINSTEIN, M. J., PAGE, T. W. JR., B. K. LIVEZEY,
(Jun.), pp. 119-130. AND G. J. POPEK. 1985. Transactions and syn-
SATYANARAYANAN, M. 1981. A Study of file sizes chronization in a distributed operating system.
and functional lifetimes. In Proceedings of the In Proceedings of the 10th Symposium on Opera-
8th Symposium on Operating Systems Principles ting Systems Principles (Orcas Island, Wash.,
(Asilomar, Calif., Dec.). ACM, New York. Dec. l-4). ACM, New York.
SATYANARAYANAN, M. 1989. Integrating security in WELCH, B. 1986. The Sprite remote procedure call
a large distributed system. ACM Trans. Comput. system. Tech. Rep. UCB/CSD 86/302, Computer
Syst. 7, 3 (Aug.), 247-280. Science Division (EECS). Universitv of Califor-
SATYANARAYANAN, M., HOWARD, J. H., NICHOLS, D. nia, Berkeley.
A., SIDEBOTHAM, R. N., SPECTOR, A. Z., AND WELCH, B., AND OUSTERHOUT, J. K. 1986. Prefix
WEST, M. J. 1985. ITC distlibuted file system: tables: A Simple mechanism for locating files in
Principles and design. In Proceedings of the 10th a distributed system. In Proceedings of the 6th
Symposium on Operating Systems Principles Conference on Distributed Computing Systems
(Orcas Island, Wash., Dec. l-4). ACM, New York, (Cambridge, Mass., May), IEEE, New York,
pp. 35-50. pp. 184-189.
SCHROEDER, M. D., BIRREL, A. D., AND NEEDHAM, ZIMMERMANN, H. 1980. OS1 reference model: The
R. M. 1984. Experience with grapevine: The IS0 model of architecture for open system inter-
growth of a distributed system. ACM Trans. Com- connection. IEEE Trans. Commun. COM-28
put. Syst. 2, 1 (Feb.), 3-23. (Apr.), 425-432.
SCHROEDER, M. D., GIFFORD, D. K., AND NEEDHAM,
R. M. 1985. A caching file system for a program-
mer’s workstation. Proceedings of the 20th Sym-
posium on Operating Systems Principles (Orcas BIBLIOGRAPHY
Island, Wash., Dec. l-4), ACM, New York,
pp. 25-32. BACH, M. J., LUPPI, M. W., MELAMED, A. S., AND
SHELTZER, A. B., AND POPEK, G. J. 1986. Internet YUEH, K. 1987. A remore file cache for
Locus: Extending transparency to an Internet RFS. Summer Usenix Conference Proceedings.
environment. IEEE Trans. Softw. Eng. SE-12, 11 Phoenix, Ariz.
(Nov.). BIRREL, A. D., AND NEEDHAM, R. M. 1980. A uni-
SMITH, A. J. 1981. Analysis of long term file refer- versal file server. IEEE Trans. Softw. Eng. SE-6,
ence patterns for application to file migration 5 (Sept.), 450-453.
algorithms. IEEE Trans. Softw. Eng. 7, 4 (Jul.). BROWN, M. R., KOLLING, K. N., AND TAFT, E. A.
SMITH, A. J. 1982. Cache memories. ACM Comput. 1985. The Alpine file system. ACM Trans. Com-
Suru. 14, 3 (Sept.), 473-530. put. Syst. 3, 4 (Nov.).

ACM Computing Surveys, Vol. 22, No. 4, December 1990


374 l E. Levy and A. Silberschatz
CABRERA, L. F., AND WYLLIE, J. 1988. Quicksilver NEEDHAM, R. M., HERBERT, A. J., AND MITCHELL, J.
distributed file services: An architecture for G. 1983. How to connect stable memory to a
horizontal growth. In Proceedings of the 2nd computer. Operat. Syst. Reu. 17, 1 (Jan.).
IEEE Conference on Computer Workstations PINKERTON, C. B., LAZOWSKA, E. D., NOTKIN, D.,
(Santa Clara, Calif.). IEEE, New York. AND ZAHORJAN, J. A. 1988. Heterogeneous re-
CALLAGHAN, B., AND LYON, T. 1989. The Auto- mote file system. Tech. Rep. 88-08-08, Dept. of
mounter. In Winter USENIX Conference Pro- Computer Science, Univ. of Washington.
ceedings (San Diego). USENIX Association, RIFKIN, FORBES, A. P., HAMILTON, R. L.,
Berkeley, California. SABREO, M., SHAH, S., AND YUEH, K. 1986.
CHARLOCK, H. 1987. RFS in SunOS. In Summer RFS architectural overview. In Summer USE-
USENIX Conference Proceedings (Phoenix, NIX Conference Proceedings (Atlanta, Ga.).
Ariz.). USENIX Association, Berkeley, USENIX Association, Berkeley, California.
California. ROSEN, M. B., WILDE, M. J., FRASER-CAMPBELL, B.
DAVCEV, C., AND BURKHARD, W. A. 1985. 1986. NFS portability. In Summer Usenix Con-
Consistency and recovery control for replicated ference Proceedings (Atlanta, Ga.).
files. In Proceedings of the 10th Symposium on SATYANARAYANAN, M. 1988. On the influence of
Operating Systems Principles (Orcas Island, scale in a distributed system. In Proceedings of
Wash., Dec. l-4). ACM, New York. the 10th International Conference on Software
Engineering, (Singapore, April).
FINLAGSON, R. S., AND CHERITON, D. R. 1987. Log
files: An extended file service exploiting write- SIDEBOTHAM, R. N. 1986. Volumes: The Andrew file
once storage. In Proceedings of the 1 Ith Sympo- system data structuring primitive. In European
sium on Operating Systems Principles (Austin, Unix Users Group Conference Proceedings (Aug.).
Tex., Nov.). ACM, New York. EUUG.
SPECTOR, A. Z., AND KAZAR, M. L. Wide area file
FRIDIRICH, M., AND OLDER, W. 1981. The Felix file
services and the AFS experimental system. UNIX
server. In Proceedings of the 8th Symposium on
Reu. 7, 3 (Mar.).
Operating Systems Principles (Asilomar, Calif.,
Dec.). ACM, New York. STURGIS, H. E., MITCHELL, J. G., AND ISRAEL, J.
1980. Issues in the design and use of a distrib-
GAIT, J. 1988. The optical file-cabinet: A random- uted file system. Operat. Syst. Reu. 14, 3 (Jul.),
access file system for write-once optical disks. 55-69.
IEEE Comput. 21,6 (June).
SVOBODOVA, L. 1986. A reliable object-oriented data
KLEIMAN, S. R. 1986. Vnodes: An architecture for repository for a distributed computer system. In
multiple file system types in Sun UNIX. In Sum- Proceedings of the 8th Symposium on Operating
mer USENIX Conference Proceedings (Atlanta, Systems Principles (Asilomar, Calif., Dec.). ACM,
Ga.). USENIX Association, Berkeley, California. New York.
MITCHELL, J. G., AND DION, J. A. 1982. Comparison WALSH, D., LYON, B., SAGER, G., CHANG, J. M.,
of two network-based file servers. Commun. ACM GOLDBERG, D., KLEIMAN, S., LYON, T.,
25, 4 (Apr.). SANDBERG, R., AND WEISS, P. 1985. Overview
MULLENDER, S. J., AND TANENBALJM, A. S. 1985. A of the Sun network filesystem. In Winter
distributed file service based on optimistic con- USENIX Conference Proceedings (Dallas, Tex.).
currency control. In Proceedings of the 10th Sym- USENIX Association, Berkeley, California.
posium on Operating Systems Principles (Orcas WUPIT, A. 1983. Comparison of Unix Network Sys-
Island, Wash., Dec. l-4). ACM, New York. tems. ACM, New York.

Received March 1989; final revision accepted December 1989.

ACM Computing Surveys, Vol. 22, No. 4, December 1990

You might also like