Here's a follow-up to an article about enhancing a fundamental storage algorithm for faster search.
System designers can choose from several kinds of algorithms to store, search, and retrieve data. Least cost to system performance in time is the most important factor, with memory use usually relegated to second place. All algorithms access data nodes with keys, but usually data structures are constructed with a single key in mind. If the data has fields that need to be accessed that are not related to the key, inefficiencies result. This implementation is meant to address these issues by setting up data structures that are fully multikey. The same basic search algorithm is used, but the data-node access structures are separate for each key.
In a March 1997 article (“Highly Dynamic Radix Fast Search Algorithm with Easy Sort/Deletion”), I described how to implement a single-key fast search using PATRICIA (Practical Algorithm to Retrieve Information Coded in Alphanumeric), which featured easy sort and deletion.1 PATRICIA, invented by Donald Morrison in 1968, is a radix trie with one child data node per parent branch node, each data node containing a single key string.2
This article builds on that previous one, available online . The fundamental storage algorithm is the same, but I've enhanced the specific implementation in several ways. Starting with the same code base, the enhancements accomplish the following:
- Do a complete multikey implementation, allowing a data node to be accessed, using PATRICIA, by any of N keys of arbitrary length.
- Enhance the print routine to display the branch-node path that each key traverses to reach the common data node.
- Write an entirely new routine to verify the integrity of the data structure at any point in its lifetime.
- Set up a new encapsulating data structure type that holds all the state for a single multikey PATRICIA data structure and could be repeatedly used in one program to declare multiple different PATRICIA search tries.
- Improve both in-code and external documentation.
- Improve the test suite provided with the code (I discuss this test suite in detail later). First, make most of the tests independent of the number of keys, their length, and the maximum total number of trie nodes. Second, use the enhanced print and verify routines to display trie state and verify its correctness. And lastly, give some idea of expected results and success or failure at getting those results.
- Provide implementations in both C and C++. Provide both static and dynamic libraries. Make sure the code runs on both little and big endian machines.
- Standardize on the GNU compiler suite, because it's the most common and not vendor specific.
The most important enhancement over my previous implementation of the routines is the algorithmic changes needed to support an arbitrary multiple number (N ) of keys to find a given data node, all using PATRICIA radix fast search. I believe this is a truly unique feature. To quote Robert Sedgewick, “PATRICIA manages to identify the bits which distinguish the search keys and build them into a data structure (with no surplus nodes) that quickly leads from any search key to the only one key in the data structure that could be equal.”3
As explained later, the characteristic of PATRICIA that enables this multikey implementation is the allocation of a new branch node for every new data node inserted. Thus, with multikey, there are now N branch nodes for every new data node. The characteristic that makes it attractive for embedded applications is the (N branch node/1 data node) structure extended to M data nodes. The entire structure is of fixed size. If all M data nodes are allocated, then all branch and data nodes are used exactly. No more or fewer branch nodes are required, depending on any of the N key structures, to lead to the data node. Storage predictability is over M nodes, and access speed is over N keys. The disadvantages are that a fixed size must be allocated and not exceeded.
It's important to understand certain basic features of PATRICIA. For example, radix searching proceeds by examining the individual bits of keys from the most significant to least significant. As searching proceeds, a series of branch nodes are traversed, each of which has a bit-location index that describes the bit in the key that should be examined, and branching left or right, is determined by the value of that bit in the key that you're looking for, and culminating at a leaf data node that will either have the required key or not. Thus, all searches are through a series of branch nodes to a final leaf data node.
The feature of PATRICIA that makes it most useful is that all insertions and deletions for a given key are in branch-/data-node pairs. Other radix search techniques may require multiple branch nodes to be inserted when inserting a new key into the trie. Thus, for PATRICIA, one can preallocate storage for any number of insertions by allocating branch-/data-node pairs and know that they'll be used up in pairs as the trie is populated. This brings us to the multikey aspect of this implementation.
This multibranch node per data node is the difference between the ordinary PATRICIA trie and this implementation. However, it's not the only important difference. Ordinary PATRICIA uses only one data-structure type, making no distinction between branch and data nodes, and storing the data in each branch node. The structure results in upward directed pointers that point to a node that's the only node that can hold the data for a given key in the trie structure. A series of branches down the trie, through nodes that hold data for other keys, will end at a node that points to the node that will hold the data, and the last data node will always be higher up in the trie than the last branching node. This structure has a significant advantage in that there's only one data-structure type.
Pointers to this type are all that's needed. In the multikey case, this won't work. Several branch nodes must point to the same data, and in classical PATRICIA, the location in the trie would be different for each key. Thus, I needed a second node type, data nodes(DNODE), which have multiple PATRICIA branch node(BNODE) tries pointing to it. This complicates the code somewhat, as there are casts occasionally from one type to another, BNODE to DNODE, or the reverse. And searching a series of BNODE to BNODE pointers will always terminate pointing to a DNODE. Because the pointer can only be declared as one type, the last node in the sequence (the data node) must have its data-node ID field stored in the same place as the branch-node ID for node recognition to work. Thus, don't relocate the node ID field from its first position in both node typedefs.
Your first decision is what to set the three structure-definable parameters to in
rdx_pat_search.h . (All files available at www.embedded.com/code/2007code.htm.) These set the number of keys (
NUM_KEYS ), the length of each key (
NUM_KEY_BYTES ), and the number of data nodes (
MAX_NUM_RDX_NODES ). The test code uses three 20-byte keys and eight data nodes. This small test is sufficient to verify the operation of all library functions without generating volumes of output.
Next you'll want to specify the details of the data structure to be stored in the
APP_DATA type in
rdx_pat_data.h (available online). Then compile by executing
rdx_pat_mk (available online). You're now ready to use the library by defining storage nodes of PNODE type. Each PNODE-type variable will have storage space for
MAX_NUM_RDX_NODES data nodes with
NUM_KEYS keys of
NUM_KEY_BYTES length. The PNODE-defined variables are then used as arguments to the library routines. The easiest way to learn how to use the library is to examine the test code in
rdx_pat_test.c (available online). All eight routines are used in several different circumstances.
The PNODE type is the basic repository of the multikey PATRICIA structures and the user data. Its structure is worth understanding in some detail. In
rdx_pat_search.h , the unsigned int
tot_nodes holds the count of allocated nodes, not including the always-allocated root node. BNODE
*head[NUM_KEYS]; is the array of pointers to the heads of the PATRICIA tries with one trie for each key.
DNODE *node_ptrs[MAX_NUM_RDX_NODES+1]; holds the pointers to each data node that are set by the
rdx_pat_sort() routine. The array is initialized to 0xf0 on each call to
rdx_pat_sort() . On return from
rdx_pat_sort() , it'll have the first
tot_nodes+1 elements set to pointers to the allocated data nodes plus the root node. The last elements should still be 0xf0. The
node_ptrs_cnt variable holds the number set pointers, which should equal
Now we come to the heart of the PATRICIA trie structure: branch and data nodes. For each key, and there are
NUM_KEYS of them, there is a trie. The
BNODE bnodes[MAX_NUM_RDX_NODES+1][NUM_KEYS]; declaration defines
NUM_KEYS branch nodes for each user-data node plus one for the root node. These will form the
NUM_KEYS PATRICIA tries that lead to the data nodes. As the PATRICIA trie grows with the allocation of more data nodes, each new data node will result in
NUM_KEYS branch nodes being inserted at some point in the
NUM_KEYS PATRICIA tries depending on the nature, or the individual bits of the different keys. When the full
MAX_NUM_RDX_NODES+1 nodes are allocated, there would be
(MAX_NUM_RDX_NODES+1)*(NUM_KEYS) branch nodes linked in to
NUM_KEYS PATRICIA tries. The
BNODE *bfree_head[NUM_KEYS]; pointers are to the head of the free list of branch nodes for each key. The
DNODE dnodes[MAX_NUM_RDX_NODES+1]; declaration defines the data nodes with the keys and application data in them.
DNODE *dfree_head; defines the free list of the data nodes.
The BNODE type starts with the unsigned int
id; , which should always be 0, and distinguishes a branch node from a data node whose ID should always be 1. This field lets search code know when the end of a search occurs at a data node. This field must always come first (in the data node also) so the inspection of this field with either node-type cast will get the correct value. The unsigned int
br; left-/right-branch indicator and
void *p; parent-node pointer are set upon insertion and used by deletion for repairing the pointers of branch nodes adjacent to the deleted branch node.
The unsigned int
nsn; branch-node sequence numbers are new with this multikey implementation and are used by the
rdx_pat_verify() routine to help verify that node data isn't corrupt, and by
rdx_pat_print() to give the user information about sequential node locations in the PNODE storage structure. The unsigned int
b; bit number is the bit, starting with 0 on the least significant bit at the right, that defines what bit in the key distinguishes whether a left or right branch is taken down the radix PATRICIA trie. These bit numbers, their meaning and manipulation, are the defining feature of PATRICIA tries. The
void *l, *r; left and right pointers are the links down the trie to the next branch node or, in the terminal case, to the data node.
The DNODE type starts with the unsigned int
id; and should be 1. The unsigned int
br[NUM_KEYS]; serves the same purpose as in the branch nodes, indicating a left(1) or right(0) branch parent, but has
NUM_KEYS possibilities because there are
NUM_KEYS parents, one for each key. The unsigned int
nsn; data-node sequence numbers are used for the
rdx_pat_verify() routine to help verify that node data isn't corrupt, and by
rdx_pat_print() to provide information about sequential node location in the PNODE storage structure.
void *nnfp; next-node free pointer is used to build the data-node free list. The unsigned int
alloc; boolean indicates whether or not the data is allocated(1) or free(0), and is used by both
rdx_pat_verify() , the former to provide user information and the latter to help verify the data structure. Finally, the unsigned char
key[NUM_KEYS][NUM_KEY_BYTES+1]; array holds the
NUM_KEYS keys of
NUM_KEY_BYTES . The extra byte is for comparison with the root-node keys, which are all 0xff of one extra byte, the impossible key. Lastly,
APP_DATA data; holds the application data defined in
General API issues
All routines parameter lists begin with PNODE
*pnodep , a pointer to the basic data structure. Each declaration of PNODE type will result in a different trie of
MAX_NUM_RDX_NODES user-data nodes of
NUM_KEYS keys with
NUM_KEY_BYTES bytes per key. Begin all calls with the form:
PNODE data;rdx_pat_xxx(&data, ...
For search, insert, delete, and print, the second argument will be unsigned char
key[NUM_KEYS][NUM_KEY_BYTES] , which holds the
NUM_KEYS keys of
NUM_KEY_BYTES bytes length. Internally, the keys are stored with one extra high-order byte of 0 for comparison with the root node 0xff key extra high-order byte. If each key is unique within its key
index(0 - NUM_KEYS-1) , then an insertion will succeed. All keys must be of the same length. For actual shorter keys, use only the low-order bytes of the key space with high-order bytes set to 0.
rdx_pat_verify() routine takes an enum argument of
VERIFY_MODE vm . The
ERR_CODE value will cause upon error the return of an integer error code only. The
ERR_CODE_PRINT value will cause upon error the return of the same integer error code, print an error message, and print the free branch-/data-node addresses, the allocated branch/data nodes addresses, and the key values.The print and verify routines take an argument of type
FILE *fp to print to. This can just as easily be a file as a print queue.Routines
rdx_pat_initialize() –After any PNODE definition, you'll want to do initialization. Failure to do this will result in unpredictable results. Every library routine depends on node variables being in a consistent state. Start by setting the entire data structure to 0xf0. This improves the chances of finding problems if something goes wrong. When the print or verify routines are used, this lets the user see or the verify routine to run into 0xf0 where it shouldn't be. Also note that the full user-data structure of each node is set to 0xf0. This might alert the user to any uninitialized portions of data. Typical use:
rdx_pat_search() –The search routine uses (in sequence) each of the
NUM_KEYS PATRICIA tries constructed by the insert routine. Each trie is traversed until either a data node is found with that key or not. If all
NUM_KEYS keys are found and in the same data node, then the search was successful. Success returns a pointer to the found DNODE. Failure returns
NULL . Typical usage is:
PNODE data;unsigned char keys[NUM_KEYS][NUM_KEY_BYTES];DNODE *dnp;dnp = rdx_pat_search(&data, keys);
rdx_pat_insert() –Insertion is the most complex of the operations. Each key must be unique within its own key
index(0-NUM_KEYS) . This means that all of the keys that will define the new data node must be searched for, and if found, will stop the insertion attempt with a return code of 1 and a pointer to the existing key node passed back as the third argument. If no key is found in the trie, a node is allocated from the free queue. If no free nodes are available, the return code is set to 2 and a
NULL pointer is assigned to the return third argument.
PNODE data;unsigned char keys[NUM_KEYS][NUM_KEY_BYTES];DNODE *dnp;int return_code;return_code = rdx_pat_insert (&data, keys, &dnp);
rdx_pat_delete() –Deletion proceeds by searching for each of the
NUM_KEYS keys passed in. If any of the keys aren't found, the deletion fails and
NULL is returned. If all
NUM_KEYS keys are found, deletion continues by removing from the trie structure the data node found and the
NUM_KEYS branch nodes that are the data node's
PNODE data;unsigned char keys[NUM_KEYS][NUM_KEY_BYTES];DNODE *dnp;dnp = rdx_pat_delete(&data, keys);
rdx_pat_sort() –Sorting this structure comes simply by recursively traversing the trie and storing pointers to the nodes in order. The third-argument integer specifies the key
index(0-NUM_KEYS) that will be sorted over. If it's outside
0-NUM_KEYS-1 , return -1. If the trie is empty, return 0. Typical usage:
PNODE data;DNODE **nodes;unsigned int k;int return_code;return_code = dx_pat_sort(&data, &nodes, k);
rdx_pat_nodes() simply returns the number of user allocated nodes in the passed in PNODE. Typical usage:
PNODE data;int n;n = rdx_pat_nodes(&data);
rdx_pat_print() routine has two modes, distinguished by the value of the passed data in second argument. If the argument is
NULL , the
NUM_KEYS branch nodes and the data nodes for each of the total allocated nodes is printed. No attempt is made to associate branch nodes that lead via PATRICIA trie hierarchy to a given data node. All nodes are printed. The
NUM_KEYS branch nodes that are printed with a given data node may be connected in an unspecified way with any data node in the trie. This mode is useful for cases where only a few data nodes are allocated and printing the entire trie will show vis inspection all the trie connections. The second mode is specified when the argument is not
NULL and holds a key argument with the full
NUM_KEYS keys specified. In this case, for each
NUM_KEYS key, the full path from the root node to the data node through all the intervening branch nodes is printed. This gives a means to see how each branch node is connected to the next, leading to the data node for all of the keys. Typical usage:
PNODE data;unsigned char keys[NUM_KEYS][NUM_KEY_BYTES];FILE *fp;int return_code;return_code = rdx_pat_print(&data, NULL, fp);return_code = rdx_pat_print(&data, keys, fp);
rdx_pat_verify() is by far the largest routine of the library. It provides a way to verify the integrity of the multikey PATRICIA trie structure. For example, an embedded systems developer might suspect that his trie has been corrupted by some hardware fault or maybe a neutrino or an erroneous memory access, and wants to know if the trie is in a consistent state, available for access, and that at least the manipulation of the trie via any key is available. The application data's integrity isn't guaranteed however, because
rdx_pat_verify() has nothing to do with
rdx_pat_verify() makes 25 types of error checking. Any error will result in a non-zero integer return code and termination of the routine. There are two modes of operation based on the value of the enum second argument: if
ERR_CODE , only an integer code is returned; if
ERR_CODE_PRINT , a message is printed to the third argument file pointer and also a listing of all of the free and allocated branch-/data-node addresses, node indexes, and data-node keys. Typical usage:
PNODE data;FILE *fp;int return_code;return_code = rdx_pat_verify(&data, ERR_CODE, fp);return_code = rdx_pat_verify(&data, ERR_CODE_PRINT, fp);
rdx_pat_test.c (available online) to test the library routines and provide use examples. All the tests involve two PNODE PATRICIA tries and an associated array for each to hold keys. These are:
PNODE rdx_data1;unsigned char rdx_data1_keys[MAX_NUM_RDX_NODES] [NUM_KEYS][NUM_KEY_BYTES];PNODE rdx_data2;unsigned char rdx_data2_keys[MAX_NUM_RDX_NODES+1] [NUM_KEYS][NUM_KEY_BYTES];
The constants are
NUM_KEY_BYTES=20 defined in
rdx_pat_search.h . These can be adjusted to any non-zero integers. Actually, (1,1,1) works! However, a trie of one node with one key of one byte isn't very useful. Real tries will require more nodes, and the test code should all work correctly for any positive integer values. However, keep in mind that the printout will grow unwieldy. For large tries, you may want to disable all but one or a few tests to check specific results.
As previously noted, the storage allocated for all keys is the same. The question arises as to how to position keys that are shorter than the longest key. As long as all other non-key bytes are zero, it doesn't matter. However, from a performance perspective, left justifying is best.
Gbit() will find the bits that distinguish the key from all others more quickly.
I've run some tests with the valgrind profiler and found that most of the time was spent in the
gbit() routine. A valgrind memcheck also passes.
I also provide a test routine for the
gbit() static internal routine. This routine is critical to the correct operation of the library.
I've run tests on my Linux box with
MAX_NUM_RDX_NODES=100000 , and
NUM_KEY_BYTES=64 successfully. This resulted, with the suite of tests provided, in a
rdx_pat_test.results file size of 1.4 Gbytes! (File available online). The same test with
MAX_NUM_RDX_NODES=1000000 fails with a segmentation fault. Remember that this was done on a Linux machine. Other RTOSes may differ.
Richard Hogaboom has an MS in physics and has published in The C Users Journal, The Perl Journal, and Embedded Systems Programming. Currently, he's working on Group 65: Advanced Networks and Applications, at MIT Lincoln Laboratory, through Solidus Technical Solutions Inc. (www.solidus-ts.com). He can be reached at .
1. Hogaboom, Dick.”Highly Dynamic Radix Fast Search Algorithm with Easy Sort/Deletion,” Embedded Systems Programming , March 1997, p72. Available at www.embedded.com/showArticle.jhtml?articleID=199500801.
2. Morrison, Donald R. “PATRICIA–Practical Algorithm To Retrieve Information Coded in Alphanumeric,” Journal of the Association for Computing Machinery , Vol. 15, No. 4, Oct. 1968, pp. 514-534.
3. Sedgewick, Robert. Algorithms in C . Addison-Wesley, 1998, Patricia Tries, pp. 623-631.
Knuth, Donald. The Art of Computer Programming: Sorting and Searching. Addison-Wesley, 1973, pp 490-499.