III BSc (Semester – VI)            Distributed Systems                  Unit V
UNIT V
    File Models, File Accessing Models, File Sharing Semantics, File Caching
    Schemes, File Replication, Atomic Transactions, Cryptography,
    Authentication, Access control and Digital Signatures.
                            *****************
    Introduction
          Distributed file systems support the sharing of information in the
    form of files and hardware resources. Goal of distributed file service
    Enable programs to store and access remote files exactly as they do
    local ones File system were originally developed for centralized
    computer systems and desktop computers. File system was as an
    operating system facility providing a convenient programming interface
    to disk storage.
    Characteristics of File Systems
      Ø File systems are responsible for the organization, storage,
         retrieval, naming, sharing and protection of files.
      Ø Files contain both data and attributes.
      Ø Files are managed by using a data structure called as a attribute
         record which consists of information about the attributes of a file.
      Ø A typical attribute record structure is illustrated in below figure
    File Models:
    Unstructured and Structured Files
          In the unstructured model, a file is an unstructured sequence of
    bytes. The interpretation of the meaning and structure of the data
    stored in the files is up to the application (e.g. UNIX and MS-DOS). Most
    modern operating systems use the unstructured file model.
          In structured files (rarely used now) a file appears to the file
    server as an ordered sequence of records. Records of different files of
    the same file system can be of different sizes.
    Mutable and Immutable Files
    Based on the modifiability criteria, files are of two types, mutable and
    immutable. Most existing operating systems use the mutable file model.
    An update performed on a file overwrites its old contents to produce the
    new contents.
          In the immutable model, rather than updating the same file, a
    new version of the file is created each time a change is made to the file
    contents and the old version is retained unchanged. The problems in
    this model are increased use of disk space and increased disk activity.
         1   Prepared by P.Y.Kumar © www.anuupdates.org
III BSc (Semester – VI)              Distributed Systems                   Unit V
    1. Explain File Accessing Models in Distributed System.
          This depends on the method used for accessing remote files and
    the unit of data access.
    1. Accessing Remote Files:
          A distributed file system may use one of the following models to
    service a client file access request when the accessed file is remote:
       Ø Remote service model
          Processing of a client request is performed at the server node.
    Thus, the client request for file access is delivered across the network as
    a message to the server, the server machine performs the access
    request, and the result is sent to the client. Need to minimize the
    number of messages sent and the overhead per message.
       Ø   Data-Caching Model
          This model attempts to reduce the network traffic of the previous
    model by caching the data obtained from the server node. This takes
    advantage of the locality feature of the found in file accesses. A
    replacement policy such as LRU is used to keep the cache size bounded.
    2. Unit of Data Transfer:
           In file systems that use the data-caching model, an important
    design issue is to decide the unit of data transfer. This refers to the
    fraction of a file that is transferred to and from clients as a result of
    single read or write operation.
    File-Level Transfer Model
          In this model when file data is to be transferred, the entire file is
    moved.
          Advantages: file needs to be transferred only once in response to
    client request and hence is more efficient than transferring page by
    page which requires more network protocol overhead. Reduces server
    load and network traffic since it accesses the server only once. This has
    better scalability. Once the entire file is cached at the client site, it is
    immune to server and network failures.
        Disadvantage: requires sufficient storage space on the client
    machine. This approach fails for very large files, especially when the
    client runs on a diskless workstation. If only a small fraction of a file is
    needed, moving the entire file is wasteful.
    Block-Level Transfer Model
       File transfer takes place in file blocks. A file block is a contiguous
    portion of a file and is of fixed length (can also be a equal to a virtual
    memory page size).
           2   Prepared by P.Y.Kumar © www.anuupdates.org
III BSc (Semester – VI)            Distributed Systems                   Unit V
       Advantages: Does not require client nodes to have large storage
    space. It eliminates the need to copy an entire file when only a small
    portion of the data is needed.
       Disadvantages: When an entire file is to be accessed, multiple
    server requests are needed, resulting in more network traffic and more
    network protocol overhead. NFS uses block-level transfer model.
    Byte-Level Transfer Model:
          Unit of transfer is a byte. Model provides maximum flexibility
    because it allows storage and retrieval of an arbitrary amount of a file,
    specified by an offset within a file and length. Drawback is that cache
    management is harder due to the variable-length data for different
    access requests.
    Record-Level Transfer Model:
          This model is used with structured files and the unit of transfer is
    the record.
    2. Explain File-Sharing Semantics in Distributed Systems.
          Multiple users may access a shared file simultaneously. An
    important design issue for any file system is to define when
    modifications of file data made by a user are observable by other users.
    UNIX Semantics:
          This enforces an absolute time ordering on all operations and
    ensures that every read operation on a file sees the effects of all
    previous write operations performed on that file.
           The UNIX semantics is implemented in file systems for single
    CPU systems because it is the most desirable semantics and because it
    is easy to serialize all read/write requests. Implementing UNIX
    semantics in a distributed file system is not easy. One may think that
    this can be achieved in a distributed system by disallowing files to be
    cached at client nodes and allowing a shared file to be managed by only
    one file server that processes all read and write requests for the file
    strictly in the order in which it receives them. However, even with this
    approach, there is a possibility that, due to network delays, client
    requests from different nodes may arrive and get processed at the
    server node in an order different from the actual order in which the
    requests were made.
           Also, having all file access requests processed by a single server
    and disallowing caching on client nodes is not desirable in practice due
    to poor performance, poor scalability, and poor reliability of the
    distributed file system.
         3   Prepared by P.Y.Kumar © www.anuupdates.org
III BSc (Semester – VI)            Distributed Systems                         Unit V
    3. Explain File Caching Schemes in Distributed Systems.
         Every distributed file system uses some form of caching. The
    reasons are:
    Ø      Better performance since repeated accesses to the same
    information is handled additional network accesses and disk transfers.
    This is due to locality in file access patterns.
    Ø      It contributes to the scalability and reliability of the distributed file
    system since data can be remotely cached on the client node.
           Key decisions to be made in file-caching scheme for distributed
    systems:
          ü Cache location
          ü Modification Propagation
          ü Cache Validation
    Cache Location:
          This refers to the place where the cached data is stored. Assuming
    that the original location of a file is on its server disk, there are three
    possible cache locations in a distributed file system:
    Ø Server Main Memory
      In this case a cache hit costs one network access.
           It does not contribute to scalability and reliability of the
    distributed file system. Since we every cache hit requires accessing the
    server.
     Advantages:
       ü Easy to implement
       ü Totally transparent to clients
       ü Easy to keep the original file and the cached data consistent.
    Ø     Client Disk
          In this case a cache hit costs one disk access. This is somewhat
    slower than having the cache in server main memory. Having the cache
    in server main memory is also simpler.
    Advantages:
      ü Provides reliability against crashes since modification to cached
        data is lost in a crash if the cache is kept in main memory.
      ü Large storage capacity.
      ü Contributes to scalability and reliability because on a cache hit the
        access request can be serviced locally without the need to contact
        the server.
         4   Prepared by P.Y.Kumar © www.anuupdates.org
III BSc (Semester – VI)            Distributed Systems                    Unit V
    Client Main Memory:
          Eliminates both network access cost and disk access cost. This
    technique is not preferred to a client’s disk cache when large cache size
    and increased reliability of cached data are desired.
    Advantages:
      ü Maximum performance gain.
      ü Permits workstations to be diskless.
      ü Contributes to reliability and scalability.
    Modification Propagation:
           When the cache is located on client’s nodes, a files data may
    simultaneously be cached on multiple nodes. It is possible for caches to
    become inconsistent when the file data is changed by one of the
    clients and the corresponding data cached at other nodes are not
    changed or discarded.
    There are two design issues involved:
          ü When to propagate modifications made to a cached data to the
            corresponding file server.
          ü How to verify the validity of cached data.
          The modification propagation scheme used has a critical affect on
    the systems performance and reliability. Techniques used include:
    Write-Through Scheme
          When a cache entry is modified, the new value is immediately
    sent to the server for updating the master copy of the file.
    Advantage:
            High degree of reliability and suitability for UNIX-like semantics.
      This is due to the fact that the risk of updated data getting lost in the
      event of a client crash is very low since every modification is
      immediately propagated to the server having the master copy.
    Disadvantage:
              This scheme is only suitable where the ratio of read-to-write
    accesses is fairly large. It does not reduce network traffic for writes.
            This is due to the fact that every write access has to wait until
     the data is written to the master copy of the server. Hence the
     advantages of data caching are only read accesses because the server
     is involved for all write accesses.
         5   Prepared by P.Y.Kumar © www.anuupdates.org
III BSc (Semester – VI)            Distributed Systems                   Unit V
    Delayed-Write Scheme:
           To reduce network traffic for writes the delayed-write scheme is
    used. In this case, the new data value is only written to the cache and
    all updated cache entries are sent to the server at a later time.
        There are three commonly used delayed-write approaches:
   Write on ejection from cache:
         Modified data in cache is sent to server only when the cache-
   replacement policy has decided to eject it from clients cache. This can
   result in good performance but there can be a reliability problem since
   some server data may be outdated for a long time.
    Periodic write:
          The cache is scanned periodically and any cached data that has
     been modified since the last scan is sent to the server.
    Write on close:
       Modification to cached data is sent to the server when the client
    closes the file. This does not help much in reducing network traffic for
    those files that are open for very short periods or are rarely modified.
    Advantages of delayed-write scheme:
      ü Write accesses complete more quickly because the new value is
      written only client cache. This results in a performance gain.
       ü Modified data may be deleted before it is time to send to send
       them to the server (e.g. temporary data). Since modifications need
       not be propagated to the server this results in a major performance
       gain.
       ü Gathering of all file updates and sending them together to the
       server is more efficient than sending each update separately.
    Disadvantage of delayed-write scheme:
         Reliability can be a problem since modifications not yet sent to the
    server from a clients cache will be lost if the client crashes.
    Cache Validation schemes:
          The modification propagation policy only specifies when the
    master copy of a file on the server node is updated upon modification of
    a cache entry. It does not tell anything about when the file data residing
    in the cache of other nodes is updated.
          A file data may simultaneously reside in the cache of multiple
    nodes. A client’s cache entry becomes stale as soon as some other client
    modifies the data corresponding to the cache entry in the master copy
    of the file on the server.
         6   Prepared by P.Y.Kumar © www.anuupdates.org
III BSc (Semester – VI)            Distributed Systems                   Unit V
           It becomes necessary to verify if the data cached at a client node
    is consistent with the master copy. If not, the cached data must be
    invalidated and the updated version of the data must be fetched again
    from the server.
           There are two approaches to verify the validity of cached data:
    the client-initiated approach and the server-initiated approach.
    Client-initiated approach
          The client contacts the server and checks whether its locally
    cached data is consistent with the master copy. Two approaches may be
    used:
     Checking before every access:
          This defeats the purpose of caching because the server needs to
    be contacted on every access.
    Periodic checking:
         A check is initiated every fixed interval of time.
    Disadvantage of client-initiated approach: If frequency of the
    validity check is high, the cache validation approach generates a large
    amount of network traffic and consumes precious server CPU cycles.
    Server-Initiated Approach:
            A client informs the file server when opening a file, indicating
    whether a file is being opened for reading, writing, or both. The file
    server keeps a record of which client has which file open and in what
    mode.
            So server monitors file usage modes being used by different
    clients and reacts whenever it detects a potential for inconsistency. E.g.
    if a file is open for reading, other clients may be allowed to open it for
    reading, but opening it for writing cannot be allowed. So also, a new
    client cannot open a file in any mode if the file is open for writing.
          When a client closes a file, it sends intimation to the server along
    with any modifications made to the file. Then the server updates its
    record of which client has which file open in which mode.
            When a new client makes a request to open an already open file
    and if the server finds that the new open mode conflicts with the already
    open mode, the server can deny the request, queue the request, or
    disable caching by asking all clients having the file open to remove that
    file from their caches.
    Note: On the web, the cache is used in read-only mode so cache
    validation is not an issue.
         7   Prepared by P.Y.Kumar © www.anuupdates.org
III BSc (Semester – VI)            Distributed Systems                      Unit V
    Disadvantage: It requires that file servers be stateful. Stateful file
    servers have a distinct disadvantage over stateless file servers in the
    event of a failure.
     4. Explain File Replication in Distributed System.
          High availability is a desirable feature of a good distributed file
    system and file replication is the primary mechanism for improving file
    availability.
          A replicated file is a file that has multiple copies, with each file on
    a separate file server.
    Difference between Replication and Caching:
       ü A replica of a file is associated with a server, whereas a cached
         copy is normally associated with a client.
       ü The existence of a cached copy is primarily dependent on the
         locality in file access patterns, whereas the existence of a replica
         normally depends on availability and performance requirements.
       ü As compared to a cached copy, a replica is more persistent, widely
         known, secure, available, complete, and accurate.
       ü A cached copy is contingent upon a replica. Only by periodic
         revalidation with respect to a replica can a cached copy be useful.
     Advantages of Replication:
     Increased Availability:
         Alternate copies of a replicated data can be used when the
    primary copy is unavailable.
    Increased Reliability:
          Due to the presence of redundant data files in the system,
    recovery from catastrophic failures (e.g. hard drive crash) becomes
    possible.
    Improved response time:
           It enables data to be accessed either locally or from a node to
    which access time is lower than the primary copy access time.
    Reduced network traffic:
           If a files replica is available with a file server that resides on a
    client’s node, the client’s access request can be serviced locally,
    resulting in reduced network traffic.
    Improved system throughput:
           Several clients request for access to a file can be serviced in
    parallel by different servers, resulting in improved system throughput.
    Better scalability:
           Multiple file servers are available to service client requests since
    due to file replication. This improves scalability.
         8   Prepared by P.Y.Kumar © www.anuupdates.org
III BSc (Semester – VI)            Distributed Systems                     Unit V
    Replication Transparency:
          Replication of files should be transparent to the users so that
    multiple copies of a replicated file appear as a single logical file to its
    users. This calls for the assignment of a single identifier/name to all
    replicas of a file.
          In addition, replication control should be transparent, i.e., the
    number and locations of replicas of a replicated file should be hidden
    from the user. Thus replication control must be handled automatically in
    a user-transparent manner.
    Multi copy Update Problem:
          Maintaining consistency among copies when a replicated file is
    updated is a major design issue of a distributed file system that
    supports file replication.
    Read-only replication:
          In this case the update problem does not arise. This method is too
     restrictive.
    Read-Any-Write-All Protocol:
          A read operation on a replicated file is performed by reading any
    copy of the file and a write operation by writing to all copies of the file.
    Before updating any copy, all copies need to be locked, then they are
    updated, and finally the locks are released to complete the write.
    Disadvantage: A write operation cannot be performed if any of the
    servers having a copy of the replicated file is down at the time of the
    write operation.
    Available-Copies Protocol:
          A read operation on a replicated file is performed by reading any
    copy of the file and a write operation by writing to all available copies
    of the file. Thus if a file server with a replica is down, its copy is not
    updated. When the server recovers after a failure, it brings itself up to
    date by copying from other servers before accepting any user request.
     Primary-Copy Protocol:
          For each replicated file, one copy is designated as the primary
    copy and all the others are secondary copies. Read operations can be
    performed using any copy, primary or secondary. But write operations
    are performed only on the primary copy. Each server having a
    secondary copy updates its copy either by receiving notification of
    changes from the server having the primary copy or by requesting the
    updated copy from it.
          E.g. for UNIX-like semantics, when the primary-copy server
    receives an update request, it immediately orders all the secondary-
    copy servers to update their copies. Some form of locking is used and
    the write operation completes only when all the copies have been
    updated. In this case, the primary-copy protocol is simply another
    method of implementing the read-any-write-all protocol.
         9   Prepared by P.Y.Kumar © www.anuupdates.org
III BSc (Semester – VI)           Distributed Systems                      Unit V
    5. Explain Atomic Transaction in Distributed System.
         A sequence of operations that perform a single logical function
    Examples
      Ø Withdrawing money from your account
      Ø Making an airline reservation
      Ø Making a credit‐card purchase
      Ø Registering for a course at WPI
    Usually used in context of databases
    Definition –Atomic Transaction:
    A transaction that happens completely or not at all.
       Ø No partial results
    Example:
       Ø Cash machine hands you cash and deducts amount from your
         Account
       Ø Airline confirms your reservation and
              ü Reduces number of free seats
              ü Charges your credit card
              ü (Sometimes) increases number of meals loaded on flight
    Atomic Transaction Review:
    Fundamental principles –A C I D
      ü Atomicity–to outside world, transaction happens indivisibly
      ü Consistency–transaction preserves system invariants
      ü Isolated–transactions do not interfere with each other
      ü Durable-
        Once a transaction “commits,” the changes are permanent
        Programming in a Transaction System
    Begin transaction: Mark the start of a transaction.
    End transaction: Mark the end of a transaction and try to “commit”.
    Abort transaction: Terminate the transaction and restore old values.
    Read: Read data from a file, table, etc., on behalf of the transaction.
    Write: Write data to file, table, etc., on behalf of the transaction
          As a matter of practice, separate transactions are handled in separ
    ate threads or processes
    Isolatedproperty means that two concurrent transactions are serialized
    I.e., they run in some indeterminate order with respect to each other
       10   Prepared by P.Y.Kumar © www.anuupdates.org
III BSc (Semester – VI)            Distributed Systems                    Unit V
    Nested Transactions:
       Ø One or more transactions inside another transaction
       Ø May individually commit, but may need to be undone
    Example:
      Ø Planning a trip involving three flights
      Ø Reservation for each flight “commits” individually
      Ø Must be undone if entire trip cannot commit
    Tools for Implementing Atomic Transactions (Single System)
    Stable storage:
    i.e., write to disk “atomically” (ppt, html).
    Log File
    i.e., record actions in a log before “committing” them (ppt, html).
        Log In Stable Storage
    Locking Protocols
            Serialize Readand Writeoperations of same data by separate
            Transactions.
    Begin Transaction
      Ø Place a begin entry in log
    Write
      Ø Write updated data to log
    Abort Transaction
      Ø Place abort entry in log
    End transaction (i.e., commit)
      Ø Place commit entry in log
      Ø Copy logged data to files
      Ø Place done entry in log
    Crash Recovery –Search Log
       Ø If begin entry, look for matching entries
       Ø If done, do nothing (all files have been updated)
       Ø If abort, undo any permanent changes that transaction may have
         made
       Ø If commitbut not done, copy updated blocks from log to files, then
         add done entry
       11    Prepared by P.Y.Kumar © www.anuupdates.org
III BSc (Semester – VI)           Distributed Systems                  Unit V
    6. Explain Cryptography in Distributed Systems.
          In the most abstract sense, we can describe a distributed system
    as a collection of clients and servers communicating by exchange of
    messages.
          Authentication of principals and messages is the major issue in
    secure distributed systems.
    Security Requirements
            Ø Confidentiality
                 ü Protection from disclosure to unauthorized persons
            Ø Integrity
                 ü Maintaining data consistency
            Ø Authentication
                 ü Assurance of identity of person or originator of data
            Ø Availability
                 ü Legitimate users have access when they need it
            Ø Access control
                 ü Un authorized users are kept out
    Modern cryptography:
       Ø Private key cryptography
            ü Problem of communicating a large message in secret is
              reduced to communicating a small key in secret.
       ü Encryption algorithm E turns plain text message M into a cipher
         text C
            – C = E(M)
       ü Decrypt C by using decryption algorithm D which is an inverse
         function of E
            – M = D(C)
       ü Confidentiality kept by keeping algorithms secret.
       ü Not practical over distributed systems – too many algorithms.
       ü Solution is to decompose algorithm
            • Function - public
            • Key – private
       ü Encryption algorithm with secret key Ke
       ü Decryption key Kd
            • M=Dkd(Eke(M))
       12   Prepared by P.Y.Kumar © www.anuupdates.org
III BSc (Semester – VI)           Distributed Systems             Unit V
       ü The function must have the properties that different messages
         with the same key and a same message with different keys will
         result in distinct cipher text.
       ü It is easy to compute the cipher text from the plaintext but
         difficult the other way.
    Hash Functions:
           Ø Creates a unique “fingerprint” for a message
           Ø Hash has to be protected in some way
    Message Authentication Codes (MACs)
          Ø secret key is used to authenticate the hash value
       13   Prepared by P.Y.Kumar © www.anuupdates.org
III BSc (Semester – VI)           Distributed Systems                        Unit V
    Public Key Cryptography:
      Ø A significant disadvantage of symmetric ciphers is the key
         management necessary to use them securely.
      Ø Uses matched public/private key pairs
      Ø Anyone can encrypt with the public key, only one person can
         decrypt with the private key
       Ø public-key cryptography        can   be   used   to   implement   digital
         signature schemes
       14   Prepared by P.Y.Kumar © www.anuupdates.org
III BSc (Semester – VI)           Distributed Systems                Unit V
    Digital Signature:
          Signature checking
          A digital signature is a mathematical scheme for demonstrating
    the authenticity of digital messages or documents. A valid digital
    signature gives a recipient reason to believe that the message was
    created by a known sender (authentication), that the sender cannot
    deny having sent the message (non-repudiation), and that the message
    was not altered in transit (integrity)
          Digital signatures are a standard element of most cryptographic
    protocol suites, and are commonly used for software distribution,
    financial transactions, contract management software, and in other
    cases where it is important to detect forgery or tampering.
                          ******************
    The following are the Important Questions from UNIT-I:
    1. Explain File Accessing Models in Distributed System.
    2. Explain File-Sharing Semantics in Distributed Systems.
    3. Explain File Caching Schemes in Distributed Systems
    4. Explain File Replication in Distributed System.
    5. Explain Atomic Transaction in Distributed System.
    6. Explain Cryptography in Distributed Systems.
                          ******************
       15   Prepared by P.Y.Kumar © www.anuupdates.org