1959 PROCEEDINGS OF THE WESTERN JOINT COMPUTER CONFERENCE 295
File Searching Using Variable Length Keys
R E N E D E LA B R I A N D A I S f
M' ANY computer applications require the storage
of large amounts of information within the
computer's memory where it will be readily
available for reference and updating. Quite commonly,
it just the key words and the location on the correspond-
ing record. A particular record is then located by search-
ing the index to determine the record's location and
then taking the most rapid approach in arriving at the
more storage space is required than is available in the record. Since only the key words and the locations of the
computer's high-speed working memory. It is, therefore, corresponding records are stored in the index this tech-
a common practice to equip computers with magnetic nique reduces the amount of information which must be
tapes, disks, or drums, or a combination of these to handled during a search. With the smaller amount of in-
provide additional storage. This additional storage is formation involved it is often possible to utilize the com-
always slower in operation than the computer's working puter's high-speed memory for the storage and search-
memory and therefore care must be taken when using it ing of the index. Furthermore, this index can now be
to avoid excessive operating time. ordered or otherwise subjected to speed-up techniques.
This paper discusses techniques for use in locating This index approach often can greatly improve the op-
records stored within a low-speed memory medium erating efficiency of record handling programs. In many
where they are identifiable by a key word or words of instances this improvement is sufficient but there are
variable length on a machine not equipped to accomplish also many cases where a further increase in efficiency
this automatically. The technique is also applicable to the is necessary. In particular, the time required to per-
conversion of variable word length information into form the search when consulting the index may still be
fixed length code words. objectionably large. If this is true then it is necessary to
When records can be stored in a slower memory medi- apply a speed-up technique to the searching operation.
um in such a fashion that their exact location may be Of course these techniques can be applied to any table
determined from the nature of their designation, rea- lookup problem where the nature of the key word or
sonably efficient handling procedures can be established. words does not lead directly to the desired entry.
However, as is often the case, the records cannot be so Peterson 1 has suggested a method of arranging such
easily located and it becomes necessary to examine each an index which greatly reduces the lookup time when
entry in order to locate a particular record. Sequential it can be applied. This method, referred to as the
examination of the key words of each record, until the "bucket method," calls for randomizing the digits of the
desired record is located, is not a satisfactory approach key word to produce a number which indicates that
on machines not having automatic buffered searching point in the available memory where a particular key
facilities, and may not be satisfactory on machines so and its corresponding record location should be stored.
equipped, if, for instance, reels are searched which need If this particular space is not available it is stored in
not be because it is not known in advance that they are the next highest (or lowest) available space. When seek-
not needed. Because the average search time for the ing a particular record, the exact randomization process
desired records is proportional to the number of records is repeated producing the same indicated point and a
stored in this slower memory, the total operating time search is begun from that point in memory and in the
of a program is proportional to the product of the num- previously used direction. When using this method,
ber of records stored and the number of records for which facility must be provided to continue from the other end
search is instigated. This product may approach the when one limit of the available space is reached. During
square of the number of records involved. This relation- the process of placing an entry in this table a record is
ship between operating time and the number of records kept of the number of steps which must be taken before
stored places a definite limitation on the number of finding space to store the entry. This number is then
records which may reasonably be stored by any par- compared with and, if necessary, replaces the previously
ticular program. Fortunately, if the records can be occurring maximum. This maximum can then be used to
stored with the key words in some ordered arrangement, limit the operation when a search is undertaken for an
an educated guess can then be made as to the location item for which there is no entry.
of a particular record, and a better system will result. The object of this procedure is to distribute the rec-
However, records cannot always be arranged in such a ords evenly throughout the available space in spite of
fashion. uneven characteristics which occur because of similar-
When records are large compared to the key word or ities in the structure of the keys. A limitation of this
words, a useful technique is to form an index having in
1
W. W. Peterson, "Addressing for random-access storage," IBM
f U. S. Naval Ordnance Lab., Corona, Calif. J. Res. Dev., vol. 1, pp. 130-146; April, 1957.
296 1959 PROCEEDINGS OF THE WESTERN JOINT COMPUTER CONFERENCE
method is that it is necessary to store the key as a part each running of the program. An address, usually the
of the entry in order to identify an item positively lowest numbered available cell, is assigned as the start-
during the searching phase. This increases the storage ing point of the table of first letters. The key words are
space used and, as the memory fills up, the average examined letter by letter and each first letter which
number of spaces which must be examined before an occurs is entered in the table of first letters if such an
entry is located increases. This saturation effect greatly
decreases the operating efficiency. Furthermore, this
1st Letter Table * C~F
method becomes far too involved from the bookkeeping
standpoint when the keys are variable word-length
words which exceed the length of one computer word 2nd Letter Tables A 0 f X d
when working with a fixed word-length machine.
A problem involving variable word-length information
confronted us in writing a FORTRAN type compiler
for the Datatron 205. In this version of FORTRAN we
allow the names of quantities to be of any length and 5th Letter Tables Y T D C~E "D"
they may consist of any number of separate words pro-
vided there are no intervening special characters. For 6th Letter Tables _ _ _ R IT
reasons of simplicity in the internal handling, each of
these external names must be converted to a code word 7th Letter Tables C
of specific length such that it may be stored with other All entries of any one table are covered by a single arc (—),
information within one cell. Further, it is necessary that
Fig. 1—Formation of a set of tables.
the external name be preserved and made easily avail-
able for annotating the finished program. T o accomplish
this, each external word is placed in a file which, for entry does not already exist (Fig. 1). A new table is
future reference, is written on tape as it is formed. The assigned to each of these letters. I t is formed by assign-
position of each entry in this file is then used as the ing it a starting address. Each second letter which occurs
code word for the name. Knowing the position of a is entered in the table assigned to the letter which it
particular external word makes it very simple to recover follows. To each of these second letter entries a new table
for annotation purposes. is assigned as before. Each third letter is placed in the
When it is necessary to have some method of deter- table assigned to the letter pair which it follows. Thus,
mining whether any given word has occurred previously, there will be a table for the letters which follow "CA"
sequentially scanning the previous entries is impractical and a different table for those letters which follow "CB"
because of the limitations mentioned previously. A more if these combinations occur. To each third letter a table
desirable situation would be a scheme that in no way is assigned and the technique is repeated until all letters
depended upon the amount of previously stored in- of the key have been taken care of.
formation in the file. If this could be accomplished, the In our program the words which make up a key are
operating time would then be more nearly proportional compressed to eliminate blank spaces before being
to the number of items for which a search was performed placed in the table as one word. A blank space is then
rather than the product of this number and the total used to signify the end of the word. This blank space is
number of entries in the file. stored in the set of tables as a signal that the entry is
The technique we have developed accomplishes the complete and the code word which was previously de-
goal. The operating time is related to each letter of the termined is stored with this blank instead of an indica-
external word and is, therefore, proportional to the tion of a table assigned for the next letter. If preserva-
number of letters in each word for which a search is per- tion of the blanks is necessary, a special unique mark
formed. The size of the file has little effect on the op- may be used to signal the end of a key.
erating time. Total operating time is proportional to the When attempting to determine the previous occur-
number of external words multiplied by the time re- rence of a particular word the procedure is first to scan
quired for a word of average length. the table of first letters until the desired letter is found.
In our particular application each external name The second letter is then compared with the various
becomes a record which is stored on tape. T h e location entries in the assigned table until agreement is found for
of the first word of each record is the code for that ex- the second letter. The third letter is compared in a
ternal word. The external word itself is the key. In the similar manner with the indicated table and the process
computer's main (drum) memory we form an index of is continued until comparison occurs with a blank. When
the key and its corresponding code. The organization this occurs the desired code may be found in the re-
of this index is the reason for this method's efficiency. mainder of the entry with that blank. In the event that
We call this the "letter tables" method. The index con- the desired letter does not occur in a particular list, it is
sists of a set of tables with the number of tables as well known that this word is not in the index. If this word is
as the number of entries in each table varying with to be added, it is now necessary to store the remaining
De La Briandais: File Searching Using Variable Length Keys 297
letters of the word in the index using the previously de- Fig. 2(b) shows the configuration used for the storage
scribed technique. The first letter to be added will be of a blank which signifies the end of a word. In this
the one which did not occur. Once a letter has been word, the first two digits are blank, the next four digits
added to the tables there are no entries in the newly are the code, and as in the previous case, the last four
formed table so no further searching is necessary, and it digits indicate the address of the next entry. The end
is only necessary to add each letter remaining in the of a particular list is signified by the absence of an
word to the new tables. address indicating the next entry.
As previously mentioned, the code is stored with the
blank which signifies the end of the word. This code is Cell Contents
the next available location for a record in the external 2000 C 2001
language file. As soon as the index is complete the ex- 2001 A 2002 2007
ternal word, which is this next record, is placed in the 2002 N 2003
file. The location indicator is adjusted to indicate the 2003 - CODE 2004
next available space in this file and this determines 2004 D 2005
what the next code word will be. 2005 Y 2006
Since the amount of space required for any of the 2006 - CODE
2007 0 2008
tables in this type of operation depends upon the man-
2008 U 2009
ner in which the letters happen to follow each other, it 2012
2009 L 2010
becomes necessary to assign space to each of the tables 2010 D 2011
as it is needed. This is best done by assigning the next 2011 - CODE
available space to whichever table is being expanded. 2012 N 2013
The programming principles involved in this type of 2013 T 2014 -^
operation were first described by Newell and Shaw. 2 2014 - CODE
However, since in our application it is not necessary to
Fig. 3—Memory distribution of a set of tables.
remove entries from the tables as was the case in their
application, a less involved method than the one they
described can be applied. If a continuous portion of Fig. 3 shows a memory distribution which would re-
memory can be devoted to the storage of the tables it sult from the storage of the four words: can, candy,
can be utilized in a sequential fashion with the next count, and could. In this example there are a total of
available space being the next cell. A simple counter twelve separate tables stored in the fifteen spaces. Al-
can then be used to keep track of this next available though the system may appear more complicated, no
space. This operation results in the storage of the various more bookkeeping is required than in a system of se-
tables in an overlapping fashion and, therefore, it is quentially searching a list of entries where the different
necessary that each entry in a table have an indication entries require different numbers of words in memory
of the location of the next entry in that table. for their storage.
An additional time-saving feature t h a t can be applied
with a slight additional cost of memory space is the
2-diglt establishment of a full set of possible first letters with
4-digit address 4-digit address
Sg letter
code of assigned table of next entry the corresponding second letter tables assigned starting
points. By this we mean that if the first letter is " E " the
(a) first entry in the assigned second letter table will be in
the fifth cell with respect to the beginning of the tables.
4-digit address Those first letters which do not occur are then wasting
Sg 0 0 CODE one memory word each.
of next entry
Since in this scheme there is only one letter stored in
(b) each word, it requires far more space than other schemes
Fig. 2—(a) Format of a letter entry, (b) Format of a where several letters are stored in one word. However,
word-terminating entry. the advantage in scanning speed makes up for this dis-
advantage and it becomes practical to form several
Fig. 2 shows the two types of entries in these tables. tables of this type, storing them in a slower memory
The letter entry shown in Fig. 2(a) has the letter in the medium until needed. When doing this, difficulty can
two digits on the left end of the word, and the four arise if care is not taken to avoid excessive transfers to
digits on the right end are used for the address of the and from this slower memory. Methods of overcoming
next entry in this table. The remaining four digits spec- this problem depend upon the particular application.
ify the address of the first word in the assigned table. In our application each set of tables is no longer needed
after one complete pass of the input, and when overflow
2
of space occurs, during input, those words which cannot
A. Newell and J. C. Shaw, "Programming the logic theory ma-
chine," Proc. WJCC, pp. 230-240; February, 1957. be converted are marked. After completion of the first
298 1959 PROCEEDINGS OF THE WESTERN JOINT COMPUTER CONFERENCE
input pass the set of letter tables is erased and a second We shall now attempt to compare this technique with
input pass begins at that point in the input data where Peterson's "bucket method." The bunching which we
overflow occurred. A new set of tables is formed to con- described as useful to us must be overcome when using
vert the marked words and the process can be repeated the "bucket method." This is usually done by generat-
as often as necessary. ing a number which is influenced by all characters of the
Now let us look more closely at the technique in an word and yet appears to be random with respect to
effort to determine the operating time. For the sake of them. This is extremely difficult when dealing with vari-
the following discussion, we shall assume each character able word length information, especially with long words
to be one of 40 possibilities. This list of possibilities could where only one letter differs or where the letters are the
include the 26 letters of the alphabet, ten decimal digits, same but two have been interchanged. Other difficulties
a blank, and three special characters. The maximum encountered with the "bucket method" include termi-
number of comparisons necessary to determine a word nating the search when enough entries have been ex-
then comes to 40 for each character in the word includ- amined to know that the word is not in the file and then
ing the blank which terminates the word. We find how- finding suitable space to insert the word. The resultant
ever that this maximum is seldom reached. To show bookkeeping can actually consume many times more
this, let us assume a file contains 1000 words. If the first operating time than the actual comparison operation
characters of these words are evenly distributed among requires. Thus, although fewer comparisons may be re-
the 40 possibilities there will be approximately 25 words quired when using the "bucket method," due to the vari-
starting with each character. Since 25 words can pro- able word length problems, the operating time is
vide only five-eighths of the possible entries in the sec- pushed up into the same range as that of the "letter
ond letter tables we can expect that to determine a tables method" which we have described.
word, the average number of comparisons needed will Another way in which the two methods must be com-
be 20 to determine the first letter, 13 to determine the pared is with regard to the amount of memory required
second letter, and thereafter only one per letter. Thus for the storage of similar amounts of information. In a
a nine-letter word including the blank might require an fixed word length machine up to all but one character
average of approximately 40 comparisons. might be wasted with each word stored when using the
Now we increase the number of words to 10,000 and "bucket method." Also, when a minimum of two adja-
we find that we have an average of 250 words per char- cent cells are used, one for the word and one for the code,
acter in the first letter table, 62\ per character in the an occasional word will be lost due to storage of a word
second letter tables, and approximately one and one- requiring an odd number of cells in such a position as to
half in the third letter tables. The average number of leave only one unused cell between itself and an adja-
comparisons for a nine-letter word now comes to 20 for cent entry. The amount of memory space required for
each of the first three letters and one for each of the re- the storage of a particular amount of information in the
maining letters. This new total of 66 is 1.65 times greater "letter tables method" cannot be specifically determined
than the previous average. because it is quite dependent upon the number of repeti-
When the words stored in such a fashion are taken tions of letter sequences which occur. The number of
from some formal language such as English, the num- cells required will always be greater than the number of
ber of words beginning with certain characters tends to words and may in some instances approach the total
increase and, therefore, the possibility of having all of number of characters stored. This means that the "letter
the various characters occur in the table of first letters tables method" will probably require from two to six
is decreased. This decreases the average number of times as much memory space as the "bucket method"
comparisons needed to determine the first letter. Fur- for a similar amount of information.
thermore, the number of letters which normally might The amount of code required by the "bucket method"
follow a particular letter is limited so that the average may run three to five times as much as for the "letter
number of comparisons is reduced for subsequent letters tables method" depending on the application. Also, this
also. latter method could quite probably be written as a sub-
For example we normally expect "U" to follow "Q" routine or by a generator in a compiling routine with
and one of the vowels or the letters "H," "L," "R," or much greater ease than could the "bucket method."
"Y" to follow the letter "C." Thus we find we are able The final choice of method depends on details of the
to adjust favorably the averages we determined previ- specific problem and also on operating characteristics
ously due to bunching, a phenomenon which usually of the machine on which it is to be run. It may in some
leads to decreased efficiency in other methods. In the in- cases be necessary to program and run tests before a
stance of the 10,000-word file we might expect the aver- final determination can be made. Both methods are an
ages to be more like 16, 12, 8, 3, and one thereafter order of magnitude faster than the simple sequential
which would be 44 for the nine-letter word, an improve- search and we have found them both to be of value in
ment of one-third. different parts of the FORTRAN for Datatron Project.