The Dictionary Trie, An efficient word and prefix lookup algorithmLoading the words of a dictionary into an in-memory structure for the purpose of lookups occurs quite commonly in computing. That is, if we allow word to mean any sequence of bytes, letters, or other structured data. Indeed such lookup is one of the cornerstones of computing. In our case we have a slightly extended challenge: we need to be able to lookup prefixes as well as entire words. That is, we need to be able to tell if a word exists in the list, but also if any word with a given prefix appears in the list. For the Worgle computer player this needs to be repeated with many inputs in a short time span, thus requiring a very fast algorithm. What doesn't workLet's first explore using the more common indexing approaches to see why they don't, or may not, work as desired. Hash tableA hash table, appealing since complete word lookup is very fast, would not do so well with prefix lookup: you'd be stuck iterating the entire list doing substring matching. A simple adjustment would be to simply add all prefixes for a word to the hash as well, where the value associated with the key would then indicate whether it is a word or a prefix. This seems like it might work, but then consider you'll know have thousands additional short strings in the hash. Unless you have an optimized hashing algorithm you'll end up with many collisions -- greatly hurting the search performance. The resulting hash table will also be extremely large in comparison to the dictionary size. All the words will need to be stored, plus all of their prefixes, plus the overhead of the hash indexes. One can easily forget the overhead of the hash table itself if you deal mainly with larger objects. Words however are very small. Consider that a single entry in the hash table, ignoring collisions, requires one hash value, likely a 32-bit integer, and one pointer to its contents, so another 32-bit pointer. That is 64-bits, or 8 bytes. Well, consider the English language, or just look at this document; how many words here are less than 8 characters long? The vast majority of them. Thus, the indexing overhead in a hash table is likely a lot larger than than the content itself. Binary treeA binary tree, appealing since lookup is rather fast, may also do okay with added prefixes. The tree size would greatly increase with the added prefixes, and the logarithmic lookup time would likely grow a few steps, though not too many. Binary trees of course, like hash tables, use lots of memory. While this isn't really a problem in many scenarios, since its has be transferred for each player of the game it may be a problem with flash. Another potential performance hit in the binary tree lies hidden in the details of how exactly the lookups are done. At each node the search string is compared to the node, an operation that compares several characters of two strings. While this does not increase the big-O computational complexity, it certainly impacts real time -- compared to a tightly coded node stepping algorithm it is quite plausible to spend just as many steps, if not more, comparing characters. What works: TrieThe algorithm I finally arrived at is a version of a trie -- a very specialized form of a tree which just happens to suit our needs ([link=...]Wikipedia has an adequate basic description[/link]). The basic premise of the trie is that each node contains just a single character, not an entire string. This extends the depth of the tree, but reduces the comparison overhead at each node. As the key bonus to this approach is that we don't need to explicitly add prefixes: they are inherent to the way in which the trie is constructed. If any word begins with a given prefix that means the prefix path itself exists in the trie. A further bonus is that the memory use is far reduced since any shared prefixes are stored in memory only once. That is, a trie has an automatic form of prefix compression of which the hashtable and binary trees have no equivalent. [h3]More precisely[/h3] Our trie for the language dictionary is implemented with nodes with a series of buckets. Each bucket corresponds to a particular letter in the alphabet. The bucket itself is simply a pointer to the next node in the trie. Each node also contains an additional flag to indicate it is a word end, which if not set indicates there is no word which ends at this node. The root node of the trie then holds the first letters of all words in the language. Thus 26 bucket for characters A-Z. All words beginning with A are placed in the first bucket. The first bucket then in turn has the same structure beginning with the second letter of the word. To lookup a word or prefix we start at the first letter and the root node. If the character bucket in the node has the first letter then we continue to the second character and linked node. This continues until either our input string is exhausted, or the bucket which holds the next character is empty. If the bucket is empty we have found nothing. If the input string is empty then we check the flag at the current node to see whether a word exists, or just a prefix. [h3]The suffix bonus[/h3] Words in many languages, in particular English, have common suffixes, such as "nation" and "recreation" both share the suffix "ation". It turns out that when storing in a trie, these common suffixes end up having identical node structures. This means there is an excessive amount of redundant node information in the trie. Removing this information is simply a matter of iterating over all the nodes and comparing to all other nodes. This results in a significantly compressed form of the original data. In fact the resulting data is smaller than a gzip compressed form of the input data. If this type of storage is really so good at compressing data why is it not used in general compression libraries? Firstly, since the prefix and suffix nature is rather specific to a dictionary like file, other data files may not have this structure and the trie form would be useless. Secondly, that step of iterating and comparing nodes is essentially N^2 algorithm, which means it can become horribly slow. Thirdly, this algorithm requires random access to all the data, which is only feasible if the data set fits into memory. by edA-qa mort-ora-y on Sunday May 11 2008 @ 12:23:56 |
Computing ↪Technology ⌦Reply |
© edA-qa mort-ora-y 2008