Data and File Structures
Chapter 7
CS2202‐File Organization
     2021‐2022
           Chapter 7
          Indexing
                            1
                            Overview
• An index is a table containing a list of keys associated with
  a reference field pointing to the record.
   • A simple index is simply an array of (key, reference) pairs.
• An index lets you impose order on a file without
  rearranging the file.
   - This makes record addition much less expensive than they are with
     a sorted file.
- Indexing provides multiple access paths.
   - You can have different indexes for the same data file.
• Indexing also give us keyed access to variable-length
  record files.
                                                                    2
  A Simple Index for Entry-Sequenced
                 Files I
• Suppose that we are looking at a collection of
  recordings with the following information about each
  of them:
   – Identification Number
   – Title
   – Composer
   – Artist
   – Label (publisher)
                                                     3
Contents of Sample Recording File
                                4
  A Simple Index for Entry-Sequenced
                Files II
• We choose to organize the file as a series of
  variable-length record with a size field
  preceding each record.
  - The fields within each record are also of variable-
    length but are separated by delimiters.
• We form a primary key by concatenating the
  record company label code and the record’s
  ID number.
  - This should form a unique identifier.
                                                      5
  A Simple Index for Entry-Sequenced
                Files III
• In order to provide rapid key access, we build a
  simple index with a key field associated with a
  reference field which provides the address of the
  first byte of the corresponding data record.
• The index may be sorted while the file does not
  have to be.
  - This means that the data file may be entry
    sequenced: the record occur in the order they are
    entered in the file.
                                                        6
Index of Sample Recording File
                                 7
 A Simple Index for Entry-Sequenced
               Files IV
• A few comments about our Index Organization:
  – The index is easier to use than the data file because
    1. it uses fixed-length records
    2. it is likely to be much smaller than the data file.
 – By requiring fixed-length records in the index file, we
   impose a limit on the size of the primary key field.
 – The index could carry more information than the key
   and reference fields.
    – e.g., we could keep the length of each data file record in the
      index as well.
                                                                       8
 Basic Operations on an Indexed
      Entry-Sequenced File
• Assumption: the index is small enough to be held in memory.
  Later on, we will see what can be done when this is not the
  case.
• Operations in order to maintain an indexed file
   – Create the original empty index and data files
   – Load the index into memory before using it.
   – Rewrite the index file from memory after using it.
   – Add records to the data file and index.
   – Delete records from the data file.
   – Update records in the data file.
                                                           9
  Creating, Loading and Re-writing
• The index is represented as an array of records.
   – The loading into memory can be done sequentially,
     reading a large number of index records (which are short)
     at once.
• What happens if the index changed but its re-writing
  does not take place or takes place incompletely?
   – Use a mechanism for indicating whether or not
     the index is out of date.
   – Have a procedure that reconstructs the index from
     the data file in case it is out of date.
                                                                 10
                  Record Addition
• When we add a record, both the data file and the index
  should be updated.
• In the data file, the record can be added anywhere.
   - However, the byte-offset of the new record should be saved.
• Since the index is sorted, the location of the new record
  does matter.
   - We have to shift all the records that belong after the one we
     are inserting to open up space for the new record.
   - However, this operation is not too costly as it is performed in
     memory.
                                                                  11
              Record Deletion
• Record deletion can be done using the
  methods discussed last lecture.
• In addition, the index record corresponding to
  the data record being deleted must also be
  deleted.
  – Once again, since this deletion takes place in
    memory, the record shifting is not too costly.
                                                     12
                   Record Updating
• Record updating falls into two categories:
   – The update changes the value of the key field.
   – The update does not affect the key field.
• In the first case, both the index and data file may need
  to be reordered.
   - The update is easiest to deal with if it is conceptualized as a
     delete followed by an insert (but the user needs not know
     about this).
• In the second case, the index does not need reordering,
  but the data file may.
   - If the updated record is smaller than the original one, it can be
     re-written at the same location.
   - However, if it is larger, then a new spot has to be found for it.
     Again the delete/insert solution can be used.
                                                                       13
Indexes that are too large to hold
          in memory I
• The indexes that we have considered before could
  fit into main memory.
• If this is not the case, we have the following
  problems:
   – Binary searching requires several seeks rather than
     being performed at memory speed.
   – Index rearrangement (record addition or deletion)
     requires shifting records on secondary storage →
     Extremely time consuming.
• Solutions:
   – Use a hashed organization (later)
   – Use a tree-structured index such as B-trees and B+
     trees (later)
                                                           1
                                                           4
Indexes that are too large to hold
          in memory II
• Nonetheless, simple indexes are still useful:
  – They allow the use of a binary search in a
    variable-length record file.
  – Sorting and maintaining an index is less costly
    than sorting and maintaining the data file,
    because the index is smaller.
  – If there are pinned records in the data file,
    rearrangements of the keys are possible
    without moving the data records.
  – They can provide access by multiple keys.
                                                      1
                                                      5
      Indexing to provide access by
              multiple keys
• So far, our index only allows key access. i.e., we can
  retrieve record DG188807, but we cannot retrieve a
  recording of Beethoven’s Symphony no. 9.
   → Not useful!
• We need to use secondary key fields consisting of album
  titles, composers, and artists.
   –Although it would be possible to relate a secondary key to an
   actual byte offset, this is usually not done.
   –Instead, we relate the secondary key to a primary key which
   then will point to the actual byte offset.
                                                                    1
                                                                    6
Index of Sample Recording File
                                 1
                                 7
Composer Index & Title Index
                               1
                               8
     Record Addition in multiple key
             access settings
• When a secondary index is used, adding a record
  involves updating the data file, the primary index and
  the secondary index.
   – The secondary index update is similar to the primary index
     update.
• Secondary keys are entered in canonical form (ex. all
  capitals).
   – The upper- and lower- case form must be obtained from the
     data file. As well, because of the length restriction on keys,
     secondary keys may sometimes be truncated.
• The secondary index may contain duplicate.
  – The primary index couldn’t                                    7
  Record Deletion in multiple key
             access settings
• Removing a record from the data file means
  removing its corresponding entry in the primary
  index and may mean removing all of the entries
  in the secondary indexes that refer to this
  primary index entry.
• This is too much rearrangement, specially if
  indexes cannot fit into main memory
                                               2
                                               0
   Record Deletion in multiple key
              access settings
• However, it is also possible not to worry about the
  secondary index.
   – Since, as we mentioned before, secondary keys were made
     to point at primary keys. We don’t modify the secondary
     index files.
   – When accessing the file through a secondary key, the primary
     index file will be checked and a deleted record can be
     identified.
   – The deleted record still occupy space in the secondary key
     indexes.
   – If a lot of deletions occur, we can periodically cleanup these
     deleted records.
                                                                 2
                                                                 1
      Record Updating in multiple key
              access settings
• There are three types of updates:
  – The update changes the secondary key.
     • We have to rearrange secondary index to stay in sorted
       order.
  – The update changes the primary key.
     • We have to update and reorder the primary index.
     • We have to update the references to primary key in the
       secondary index.
  – Update confined to other fields.
     • No changes necessary to primary nor secondary index.
                                                                22
        Retrieval using combinations of
                secondary keys
• With secondary keys, we can now search for things like
  all the recordings of “Beethoven’s work” or all the
  recordings titled “Violin Concerto”.
• More importantly, we can use combinations of
  secondary keys.
   – e.g., find all recordings of Beethoven’s Symphony no. 9.
• Without the use of secondary indexes, this request
  requires a very expensive sequential search through
  the entire file.
   – Using secondary indexes, responding to this query is simple
     and quick.
                                                                23
      Improving the secondary index
         structure I: The problem
• Secondary indexes lead to two difficulties:
  – The secondary index file has to be rearranged
    every time a new record is added to the file.
  – If there are duplicate secondary keys, the
    secondary key field is repeated for each entry.
     • Space is wasted.
                                                      24
         Improving the secondary index
             structure II: Solution 1
• Solution 1: Change the secondary index structure so it
  associates an array of reference with each secondary
  key.
• Advantage: helps avoid the need to rearrange the
  secondary index file too often.
• Disadvantages:
   – It may restrict the number of references that can be
     associated with each secondary key.
   – It may cause internal fragmentation, i.e., waste of
     space.
                                                       25
Array of Reference
                     26
         Improving the secondary index
            structure III: Solution 2
• Solution 2: inverted lists.
   – Each secondary key points to a different list of primary key
     references.
   – Each of these lists could grow to be as long as it needs to be
     and no space would be lost to internal fragmentation.
• Advantages:
   – The secondary index file needs to be rearranged only upon
     record addition.
   – The rearranging is faster since it is smaller.
   – Space from deleted primary index records can easily be
     reused, since its records have fixed-length.
• Disadvantage:
   – Locality (in the secondary index) has been lost.
      • More seeking may be necessary.
                                                                  27
Inverted Lists
                 28
             Selective Indexes
• Using secondary keys, you can divide the file
  into parts and provide a selective view.
• For example, you can build a selective index
  that contains only titles to classical recordings
  or recordings released prior to 1970, and since
  1970.
• A possible query could then be: “List all the
  recordings of Beethoven’s Simphony no. 9
  released since 1970.
                                                      29