Tries:Space Required and Alternative Node Structures

Space Required and Alternative Node Structures

The use of branch nodes that have as many child fields as the radix of the digits (or one more than this radix when different keys may have different length) results in a fast search algorithm. However, this node structure is often wasteful of space because many of the child fields are null. A radix r trie for d digit keys requires O(rdn) child fields, where n is the number of elements in the trie. To see this, notice that in a d digit trie with n information nodes, each information node may have at most d ancestors, each of which is a branch node. Therefore, the number of branch nodes is at most dn. (Actually, we cannot have this many branch nodes, because the information nodes have common ancestors like the root node.)

We can reduce the space requirements, at the expense of increased search time, by changing the node structure. For example, each branch node of a trie could be replaced by any of the following:

1. A chain of nodes, each node having the three fields digitV alue, child, next. Node A of Figure 28.1, for example, would be replaced by the chain shown in Fig- ure 28.2.

The space required by a branch node changes from that required for r children/pointer/reference fields to that required for 2p pointer fields and p digit value fields, where p is the number of children fields in the branch node that are not null. Under the assumption that pointer fields and digit value fields are of the same size, a reduction in space is realized when more than two-thirds of the children fields in branch nodes are null. In the worst case, almost all the branch nodes have only 1 field that is not null and the space savings become almost (1 3/r) 100%.

2. A (balanced) binary search tree in which each node has a digit value and a pointer to the subtrie for that digit value. node A of Figure 28.1.

Figure 28.3 shows the binary search tree for Under the assumption that digit values and pointers take the same amount of space, the binary search tree representation requires space for 4p fields per branch node, because each search tree node has fields for a digit value, a subtrie pointer, a left child pointer, and a right child pointer. The binary search tree representation of a branch node saves us space when more than three-fourths of the children fields in branch nodes are null. Note that for large r, the binary search tree is

image

faster to search than the chain described above.

3. A binary trie (i.e., a trie with radix 2). Figure 28.4 shows the binary trie for node A of Figure 28.1. The space required by a branch node represented as a

binary trie is at most (2 ∗ plog2 rl + 1)p.

4. A hash table. When a hash table with a sufficiently small loading density is used, the expected time performance is about the same as when the node structure of Figure 1 is used. Since we expect the fraction of null child fields in a branch node to vary from node to node and also to increase as we go down the trie, maximum space efficiency is obtained by consolidating all of the branch nodes into a single hash table. To accomplish this, each node in the trie is assigned a number, and each parent to child pointer is replaced by a triple of the form (currentN ode, digitV alue, childN ode). The numbering scheme for nodes is cho- sen so as to easily distinguish between branch and information nodes. For exam- ple, if we expect to have at most 100 elements in the trie at any time, the numbers 0 through 99 are reserved for information nodes and the numbers 100 on up are used for branch nodes. The information nodes are themselves represented as an array inf ormation[100]. (An alternative scheme is to represent pointers as tuples of the form (currentN ode, digitV alue, childN ode, childN odeIsBranchN ode), where childN odeIsBranchN ode = true iff the child node is a branch node.) Suppose that the nodes of the trie of Figure 28.1 are assigned numbers as given in Figure 28.5. This number assignment assumes that the trie will have no more than 10 elements.

The pointers in node A are represented by the tuples (10, 2, 11), (10, 5, 0), and (10, 9, 12). The pointers in node E are represented by the tuples (13, 1, 1) and (13, 8, 2).

The pointer triples are stored in a hash table using the first two fields (i.e., the currentN ode and digitV alue) as the key. For this purpose, we may transform the two field key into an integer using the formula currentN ode r + digitV alue, where r is the trie radix, and use the division method to hash the transformed key into a home bucket. The data presently in information node i is stored in inf ormation[i].

To see how all this works, suppose we have set up the trie of Figure 28.1 using the hash table scheme just described. Consider searching for an element with key 278-49-1515. We begin with the knowledge that the root node is assigned the number 10. Since the first digit of the search key is 2, we query our hash table for a pointer triple with key (10, 2). The hash table search is successful and the triple (10, 2, 11) is retrieved. The childN ode component of this triple is 11, and since all information nodes have a number 9 or less, the child node is determined to be a branch node. We make a move to the branch node 11. To move to the next level of the trie, we use the second digit 7 of the search key. For the move, we query the hash table for a pointer with key (11, 7). Once again, the search is successful and the triple (11, 7, 13) is retrieved. The next query to the hash table is for a triple with key (13, 8). This time, we obtain the triple (13, 8, 2). Since childN ode = 2 < 10, we know that the pointer gets us to an information node. So, we compare the search key with the key of the element inf ormation[2]. The keys match, and we have found the element we were looking for.

When searching for an element with key 322-167-8976, the first query is for a triple with key (10, 3). The hash table has no triple with this key, and we conclude that the trie has no element whose key equals the search key.

The space needed for each pointer triple is about the same as that needed for each node in the chain of nodes representation of a trie node. Therefore, if we use a linear open addressed hash table with a loading density of α, the hash table scheme will take approximately (11) 100% more space than required by the chain of nodes scheme. However, when the hash table scheme is used, we can retrieve a pointer in O(1) expected time, whereas the time to retrieve a pointer using the chain of nodes scheme is O(r). When the (balanced) binary search tree or binary trie schemes are used, it takes O(log r) time to retrieve a pointer. For large radixes, the hash table scheme provides significant space saving over the scheme of Figure 28.1 and results in a small constant factor degradation in the expected time required to perform a search.

The hash table scheme actually reduces the expected time to insert elements into a trie, because when the node structure of Figure 28.1 is used, we must spend O(r) time to initialize each new branch node (see the description of the insert operation below). However, when a hash table is used, the insertion time is independent of the trie radix.

To support the removal of elements from a trie represented as a hash table, we must be able to reuse information nodes. This reuse is accomplished by setting up an available space list of information nodes that are currently not in use.

Andersson and Nilsson [1] propose a trie representation in which nodes have a variable degree. Their data structure, called LC-tries (level-compressed tries), is obtained from a binary trie by replacing full subtries of the binary trie by single node whose degree is 2i, where i is the number of levels in the replaced full subtrie. This replacement is done by examining the binary trie from top to bottom (i.e., from root to leaves).

image

Comments

Popular posts from this blog

Binary Space Partitioning Trees:BSP Tree as a Hierarchy of Regions.

Data Structure Visualization:Introduction and Value of Data Structure Rendering

0/1 Knapsack Problem Memory function.