Package net.sf.eos.trie

Contains the base structure for memory based entity recognition.

See:
          Description

Interface Summary
PatriciaTrie.KeyAnalyzer<K> Defines the interface to analyze Trie keys on a bit level.
Trie<K,V> Defines the interface for a prefix tree, an ordered tree data structure.
Trie.Cursor<K,V> An interface used by a Trie.
TrieLoader<K,V> Implementations creates new tries.
TrieSource  
TrieSource.TrieEntryListener  
 

Class Summary
AbstractTrieLoader<K,V>  
ByteArrayKeyAnalyzer  
CharSequenceKeyAnalyzer Analyzes CharSequence keys with case sensitivity.
EmptyIterator Provides an unmodifiable empty iterator.
PatriciaTrie<K,V> A PATRICIA Trie.
TrieHandler  
TrieSource.TrieEntry Represents an entry in the Trie.
TrieSource.TrieEntryEvent  
TrieUtils Miscellaneous utilities for Tries.
UnmodifiableIterator<E> A convenience class to aid in developing iterators that cannot be modified.
XmlTrieLoader The builder creates a trie from a simple XML file.
 

Enum Summary
Trie.Cursor.SelectStatus The mode during selection.
 

Package net.sf.eos.trie Description

Contains the base structure for memory based entity recognition. The trie based on an PATRICIA implementation of the Limewire project. The implementation comes under the terms of version 3 of the GNU General Public License (GPL).

The main benefit for a memory based implementation for entity recognition ist the cluster structure of the Hadoop system. In such a system it is contra productive to have a central instance for entity recognition. Such a central system is always the bottleneck if it is under fire of a few hundrets of cluster node, each with X running instances. A PATRICIA trie structure consumes not as much main memory as other implementations.

To work with the trie in a cluster environment, use the service offered by AbstractTrieLoader. The default serialization format is defined in XmlTrieLoader. At this time the tries key structure is based on CharSequences. This implementation is not as memory optimized as the byte array implementation. The byte array oriented key analyzer may use CharSequences transformed in UTF-8 bytes. This safes memory for latin based languages.

For Hadoop use the distributed cache mechanism of Hadoop. See net.sf.eos.hadoop for further information.

Since:
0.1.0
Author:
Sascha Kohlmann
See Also:
net.sf.eos.hadoop, net.sf.eos.entity, net.sf.eos.hadoop.mapred.cooccurrence


Copyright © 2008. All Rights Reserved.