Work in progress

Common use-case

Creation

The common use-case bases on a co-occurrence anylze of named entities on sentence level.

First step

In a frist step all documents of a corpus will decompose into sentences. εos generates a message digest over each lower cased decomposed sentence. The message digest acts and additional metadata information acts as key in a dictonary. Additional metadata may be the year of document writing. The value of the dictionary is the sentence and the metadata of the original document. If a key is twice in the dictionary εos combines the metadata of the values.

Result: double sentences are removed.

Example usage
hadoop jar net.sf.eos-toolkit.core-<VERSION>-uberjar-executable.jar \
           net.sf.eos.hadoop.mapred.decompose.SentenceMapReduceDriver \  
       -D net.sf.eos.hadoop.mapred.AbstractKeyGenerator.impl=net.sf.eos.hadoop.mapred.decompose.TextMetaKeyGenerator \  
       -D net.sf.eos.hadoop.mapred.sentencer.TextMetadataKeyGenerator.metaKey=EosDocument/date \
       --source <SOURCE FOLDER> \
       --dest <DESTINATION FOLDER>
VERSION
The version of the εos-toolkit core implementation.
SOURCE FOLDER
The source folder containing EosDocuments to decompose into sentences.
DESTINATION FOLDER
The destination folder to store the decomposed EosDocuments.

Second step

The second step is the co-occurrence analyzing. Named entities will recognize in the sentences and replaced. In the common use-case εos identifies named entities thru a simple longest match look up. εos replace the named entity in the sentence thru a concept identifier (ID). For each replaced named entity εos puts the ID as a key in a map. The value is the sentence with the replaced ID. For each equal key εos concatenates all sentences of the ID and the metadata of the sentence and removes the key ID from the sentence. All other IDs are still in the sentence.

Result: A new document for each reconized ID.

Example usage
hadoop jar net.sf.eos-toolkit.core-<VERSION>-uberjar-executable.jar \
           net.sf.eos.hadoop.mapred.cooccurrence.DictionaryBasedEntityRecognizerMapReduceDriver \
       -Dnet.sf.eos.hadoop.mapred.cooccurrence.DictionaryBasedEntityRecognizerReducer.metaKeys=EosDocument/creator \ 
       --trie <PATH TO TRIE> \
       --source <SOURCE FOLDER> \
       --dest <DESTINATION FOLDER>
VERSION
The version of the εos-toolkit core implementation.
SOURCE FOLDER
The source folder containing EosDocuments to recognize named entities.
DESTINATION FOLDER
The destination folder to store the EosDocuments with recognized named entities.
PATH TO TRIE
Contains the path to the trie data for dictionary based named entity recognition.

Third step

The third and last step creates a Lucene based fulltext index over all concatenates sentences.

Result: A Lucene fulltext index.

Example usage
hadoop jar net.sf.eos-toolkit.core-<VERSION>-uberjar-executable.jar \
           net.sf.eos.hadoop.mapred.index.IndexMapReduceDriver \  
       --source <SOURCE FOLDER> \
       --dest <DESTINATION FOLDER>
VERSION
The version of the εos-toolkit core implementation.
SOURCE FOLDER
The source folder containing EosDocuments to decompose into sentences.
DESTINATION FOLDER
The destination folder to store the decomposed EosDocuments.

After indexing optimize the index by calling

hadoop jar net.sf.eos-toolkit.core-<VERSION>-uberjar-executable.jar \
           net.sf.eos.hadoop.mapred.index.IndexMerger \ 
       -workingdir <WORKING FOLDER> \
       <DESTINATION FOLDER> \
       <SOURCE FOLDER>
VERSION
The version of the εos-toolkit core implementation.
WORKING FOLDER
Optional local working folder.
SOURCE FOLDER
The source folder containing the Lucene index to optimize. Value of the the last preceding command.
DESTINATION FOLDER
The destination folder to store the optimized Lucene index in the Hadoop filesystem.

Lock up

TODO