§1 TRIECODE TRIE DATA STRUCTURE 1
1. Trie Data Structure.
A program to create and manipulate trie data structures.
c 2008, Carlos Oliveira
2. A trie is used to store character strings in encoded form, such that each character is a node of a tree
data structure, and a word is formed by concatenating the characters from the root of the trie to one of its
terminals. One of the advantages of using tries is that insertion and search can be done in time linear on
the size of the string.
The main restriction of this implementation is that it works only for ASCII characters the range [A-Z].
I am doing this mainly to test how well the algorithm perfoms. Real implementations should be able to
handle most character sets, and therefore they need more space to store child nodes at each trie node.
h trie.cpp 2 i ≡
h trie includes 3 i
h trie functions 4 i
3.
h trie includes 3 i ≡
#include "trie.h"
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <assert.h>
#define New (t) ( t ∗ ) malloc (sizeof (t))
#define NewV (t, s) ( t ∗ ) malloc (sizeof (t) ∗ (s))
This code is used in section 2.
4. A simple utility function that aborts the program with an error message to stderr.
h trie functions 4 i ≡
void abort (const char ∗msg )
{
fprintf (stderr , msg );
exit (1);
}
See also sections 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 16, 17, 18, and 19.
This code is used in section 2.
5. A function that checks if a character can be inserted as a child of a trie node. All characters stored are
in uppercase. The function returns an uppercase character, or 0 if the character is not between A and Z.
h trie functions 4 i +≡
char check char (char c)
{
if (’a’ ≤ c ∧ c ≤ ’z’) c −= 32;
if (’A’ ≤ c ∧ c ≤ ’Z’) return c;
return 0;
}
2 TRIE DATA STRUCTURE TRIECODE §6
6. The node initialization function just returns a new trie node, initialized to the character c. Everything
else in the node is NULL.
h trie functions 4 i +≡
static TrieNode ∗trie new node (char c)
{
TrieNode ∗ t = New (TrieNode );
if (¬t) return 0;
t~ data = c;
t~ children = Λ;
t~ n child = 0;
t~ size = 0;
t~ is terminal = 0;
return t;
}
7. A function that creates a new trie data structure.
h trie functions 4 i +≡
Trie ∗ trie new ( )
{
Trie ∗ t = New (Trie );
if (¬t) return 0;
t~ n words = 0;
t~ root = trie new node (0); /∗ alloc root element ∗/
if (¬t~ root ) goto e1 ;
return t;
e1 : free (t);
return 0;
}
8. Allocates a new array of pointers to child trie node, and initialize it to NULLs.
h trie functions 4 i +≡
static TrieNode ∗∗new child array (int size ){ TrieNode ∗ ∗nodes = NewV ( TrieNode ∗ , size ) ;
if (¬nodes ) return 0;
int i;
for (i = 0; i < size ; ++ i) nodes [i] = Λ;
return nodes ; }
9. searches an array of children of the first type (with 8 characters). Returns a trie node if the character
can be found, NULL otherwise.
h trie functions 4 i +≡
static TrieNode ∗in children8 (TrieNode ∗ t, char c)
{
int i, n = t~ n child ;
for (i = 0; i < n; ++ i)
if (t~ children [i]~ data ≡ c) return t~ children [i];
return 0;
}
§10 TRIECODE TRIE DATA STRUCTURE 3
10. adds a new node to a list of children of the first type.
h trie functions 4 i +≡
static TrieNode ∗add children8 (TrieNode ∗ t, char c)
{
assert (t~ n child < 8);
return t~ children [t~ n child ++ ] = trie new node (c);
}
11. implementation for search/add nodes to children list of second type
h trie functions 4 i +≡
static TrieNode ∗search add children26 (TrieNode ∗ t, char c, int insert )
{
assert (t~ size ≡ 26);
int pos = c − ’A’;
if (t~ children [pos ]) return t~ children [pos ];
if (¬insert ) return 0;
TrieNode ∗ node = t~ children [pos ] = trie new node (c);
if (¬node ) abort ("cannot add node in children26");
t~ n child ++ ;
return node ;
}
12. wrappers for function above
h trie functions 4 i +≡
static TrieNode ∗in children26 (TrieNode ∗ t, char c)
{
return search add children26 (t, c, 0);
}
static TrieNode ∗add to children26 (TrieNode ∗ t, char c)
{
return search add children26 (t, c, 1);
}
13. add given nodes (usually from a list of the first type) into a list of the second type
h trie functions 4 i +≡
void add existing nodes26 (TrieNode ∗ t, TrieNode ∗ ∗nodes , int n)
{
assert (t~ size ≡ 26);
int i;
for (i = 0; i < n; ++ i) {
int pos = nodes [i]~ data − ’A’;
assert (¬t~ children [pos ]); /∗ this position must be NULL ∗/
t~ children [pos ] = nodes [i];
}
}
4 TRIE DATA STRUCTURE TRIECODE §14
14. A function that checks if a character c exists as a child of the given trie node. If insert is true, the
value should be inserted as a child if it is not present. modified is an output value, and will be true if there
was any insertion on the child list.
h trie functions 4 i +≡
static TrieNode ∗get or insert child for char (TrieNode ∗ t, char c, int insert , int ∗modified , int
is terminal ){ assert (’A’ ≤ c ∧ c ≤ ’Z’);
TrieNode ∗ node ; h check cases of child list 15 i
15. To reduce memory consumption, each node can have a child list of one of two types: 8-position or
26-position arrays. The first type stores at most 8 characters. They are unordered, so any search needs to
look at all characters currently stored (which on average is 4). The second type of list has an array for all
26 possible characters. The list is in alphabetical order, so it is very easy to find and insert elements
h check cases of child list 15 i ≡
if (t~ size ≡ 0) {
if (¬insert ) return 0; /∗ add memory, return position ∗/
t~ children = new child array (8);
if (¬t~ children ) abort ("cannot add child array");
t~ size = 8;
node = add children8 (t, c);
}
else if (t~ size ≡ 8) { /∗ check if it is there ∗/
node = in children8 (t, c);
if (node ) return node ; /∗ if not, add ∗/
if (¬insert ) return 0;
if (t~ n child < 8) node = add children8 (t, c);
else { /∗ we need more space ∗/
TrieNode ∗ ∗current = t~ children ;
t~ children = new child array (26);
if (¬t~ children ) abort ("cannot add child array");
t~ size = 26;
add existing nodes26 (t, current , 8);
node = add to children26 (t, c);
}
}
else if (t~ size ≡ 26) {
node = in children26 (t, c);
if (node ) return node ;
if (¬insert ) return 0;
node = add to children26 (t, c);
}
else {
abort ("wrong size of child list");
} /∗ we come here only in case the child list was modified ∗/
if (¬node ) abort ("cannot add node");
if (is terminal ) node~ is terminal = 1;
∗modified = 1;
return node ; }
This code is used in section 14.
§16 TRIECODE TRIE DATA STRUCTURE 5
16. we visit the trie recursively, because word lengths are usually small
h trie functions 4 i +≡
static int trie visit rec (TrieNode ∗ t, char ∗word , int len , int insert , int ∗modified )
{
if (len ≡ 0) { /∗ end of recursion ∗/
if (∗modified ) return 2;
return 1;
}
char c = check char (word [0]);
TrieNode ∗ child = get or insert child for char (t, c, insert , modified , len ≡ 1);
if (¬child ) {
if (insert ) abort ("cannot insert word");
else return 0;
}
return trie visit rec (child , word + 1, len − 1, insert , modified );
}
17. searches or inserts a word into a trie. If insert is true, the word will be inserted if it is not already in
the tree. Returns true if the operation is successful.
h trie functions 4 i +≡
int trie visit (Trie ∗ t, char ∗word , int insert )
{
int modified = 0;
int res = trie visit rec (t~ root , word , strlen (word ), insert , &modified );
if (insert ∧ res ≡ 2) t~ n words ++ ;
return res ;
}
6 TRIE DATA STRUCTURE TRIECODE §18
18. Print all words contained in a trie. the algorithm is recursive. If, at some iteration, the is terminal
flag is set, then print the current word. For each of the children of the current node, add the caracter they
represent to the word, and call the function recursively.
h trie functions 4 i +≡
static int trie print rec (TrieNode ∗ t, char ∗word , int pos , int max size )
{
if (pos ≡ max size ) { /∗ check buffer overrun ∗/
word [pos ] = ’\0’;
printf ("word truncated: %s\n", word );
return 0;
}
if (t~ is terminal ) { /∗ print current word ∗/
word [pos ] = ’\0’;
printf ("%s\n", word );
}
int i;
for (i = 0; i < t~ size ; ++ i) { /∗ print all children ∗/
TrieNode ∗ node = t~ children [i];
if (node ) {
word [pos ] = node~ data ;
trie print rec(node , word , pos + 1, max size );
}
}
return 1;
}
#define MAX_SIZE 64
19. Function that calls the recursive implementation.
h trie functions 4 i +≡
int trie print (Trie ∗ t)
{
printf ("printing words\n");
char word [MAX_SIZE + 1];
trie print rec(t~ root , word , 0, MAX_SIZE);
return 1;
}
20.
h main.cpp 20 i ≡
h main headers 21 i
h main functions 23 i
21.
h main headers 21 i ≡
#include "trie.h"
#include <stdio.h>
#include <string.h>
This code is used in section 20.
§22 TRIECODE STRING OPERATIONS 7
22. String Operations.
We need a few functions to process a file and return all words found in that file. We want to strip characters
that are not handled by our trie implementation.
23. returns true if the string w is a word composed only of the characters from A-Z
h main functions 23 i ≡
int is word (char ∗w, int len )
{
int i;
for (i = 0; i < len ; ++ i)
if (¬check char (w[i])) return 0;
return i;
}
See also sections 24, 25, 26, 27, 28, and 29.
This code is used in section 20.
24. true if the character is is a punctuation character
h main functions 23 i +≡
int is punctuation (char c)
{
switch (c) {
case ’.’: case ’,’: case ’;’: case ’:’: return 1;
}
return 0;
}
25. true if the character is is a quote character
h main functions 23 i +≡
int is quote (char c)
{
switch (c) {
case ’\’’: case ’"’: case ’‘’: return 1;
}
return 0;
}
26.
h main functions 23 i +≡
#define MAX_WSIZE 64
8 STRING OPERATIONS TRIECODE §27
27. adjust word to remove quotes and/or punctuation marks
h main functions 23 i +≡
char ∗adjust word (char ∗w)
{ /∗ try to remove at most two quotes ∗/
if (is quote (∗w)) w ++ ;
if (is quote (∗w)) w ++ ; /∗ try to remove at most three quotes/punctuation marks ∗/
int len = strlen (w);
int i;
for (i = 0; i < 3 ∧ i < len ; ++ i) {
int last = len − i − 1;
if (is quote (w[last ]) ∨ is punctuation (w[last ])) w[last ] = 0;
}
return w;
}
28. reads all words in the given file name, and inserts them on the trie.
h main functions 23 i +≡
int proc file (Trie ∗ t, char ∗file )
{
FILE ∗f = fopen (file , "r");
if (¬f ) abort ("error opening file");
char w[MAX_WSIZE + 1]; /∗ write format string ∗/
char format [6];
sprintf (format , "%%%ds", MAX_WSIZE); /∗ insert each word found in file ∗/
for ( ; ; ) {
if (EOF ≡ fscanf (f , format , w)) break;
char ∗word = adjust word (w);
if (is word (word , strlen (word ))) trie visit (t, word , 1);
}
return 1;
}
29.
h main functions 23 i +≡
int main (int argc , char ∗∗argv )
{
Trie ∗ t = trie new ( );
if (¬t) return 1;
if (argc < 2 ∨ argc ≡ 3 ∧ strcmp (argv [2], "−p")) {
printf ("usage:\n %s inputFile [−p]\n", argv [0]);
return 1;
}
proc file (t, argv [1]);
printf ("number of words: %d\n", t~ n words );
if (argc ≡ 3) trie print (t);
return 0;
}
§30 TRIECODE HEADER FILE 9
30. Header File. This is the header file for the trie module.
h trie.h 30 i ≡
#ifndef _TRIE_H_
#define _TRIE_H_
struct TrieNode {
char data ;
struct TrieNode ∗∗children ;
int n child ;
int size ;
int is terminal ;
};
typedef struct TrieNode TrieNode;
struct Trie {
int n words ;
TrieNode ∗root ;
};
typedef struct Trie Trie;
Trie ∗trie new ( );
int trie visit (Trie ∗t, char ∗word , int insert );
int trie print (Trie ∗t);
char check char (char c);
void abort (const char ∗msg );
#endif
10 INDEX TRIECODE §31
31. Index.
_TRIE_H_: 30. pos : 11, 13, 18.
abort : 4, 11, 15, 16, 28, 30. printf : 18, 19, 29.
add children8 : 10, 15. proc file : 28, 29.
add existing nodes26 : 13, 15. res : 17.
add to children26 : 12, 15. root : 7, 17, 19, 30.
adjust word : 27, 28. search add children26 : 11, 12.
argc : 29. size : 6, 8, 11, 13, 15, 18, 30.
argv : 29. sprintf : 28.
assert : 10, 11, 13, 14. stderr : 4.
c: 5, 6, 9, 10, 11, 12, 14, 16, 24, 25, 30. strcmp : 29.
check char : 5, 16, 23, 30. strlen : 17, 27, 28.
child : 16. t: 30.
children : 6, 9, 10, 11, 13, 15, 18, 30. Trie: 7, 17, 19, 28, 29, 30.
current : 15. Trie : 30.
data : 6, 9, 13, 18, 30. trie new : 7, 29, 30.
EOF: 28. trie new node : 6, 7, 10, 11.
exit : 4. trie print : 19, 29, 30.
e1 : 7. trie print rec: 18, 19.
f : 28. trie visit : 17, 28, 30.
file : 28. trie visit rec : 16, 17.
fopen : 28. TrieNode: 6, 8, 9, 10, 11, 12, 13, 14, 15,
format : 28. 16, 18, 30.
fprintf : 4. TrieNode : 30.
free : 7. w: 23, 27, 28.
fscanf : 28. word : 16, 17, 18, 19, 28, 30.
get or insert child for char : 14, 16.
i: 8, 9, 13, 18, 23, 27.
in children26 : 12, 15.
in children8 : 9, 15.
insert : 11, 14, 15, 16, 17, 30.
is punctuation : 24, 27.
is quote : 25, 27.
is terminal : 6, 14, 15, 18, 30.
is word : 23, 28.
last : 27.
len : 16, 23, 27.
main : 29.
malloc : 3.
MAX_SIZE: 18, 19.
max size : 18.
MAX_WSIZE: 26, 28.
modified : 14, 15, 16, 17.
msg : 4, 30.
n: 9, 13.
n child : 6, 9, 10, 11, 15, 30.
n words : 7, 17, 29, 30.
New : 3, 6, 7.
new child array : 8, 15.
NewV : 3, 8.
node : 11, 14, 15, 18.
nodes : 8, 13.
TRIECODE NAMES OF THE SECTIONS 11
h check cases of child list 15 i Used in section 14.
h main functions 23, 24, 25, 26, 27, 28, 29 i Used in section 20.
h main headers 21 i Used in section 20.
h main.cpp 20 i
h trie functions 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 16, 17, 18, 19 i Used in section 2.
h trie includes 3 i Used in section 2.
h trie.cpp 2 i
h trie.h 30 i
TRIECODE
Section Page
Trie Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1
String Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 7
Header File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 9
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 10