Work in progress |
The common use-case bases on a co-occurrence anylze of named entities on sentence level.
In a frist step all documents of a corpus will decompose into sentences. εos generates a message digest over each lower cased decomposed sentence. The message digest acts and additional metadata information acts as key in a dictonary. Additional metadata may be the year of document writing. The value of the dictionary is the sentence and the metadata of the original document. If a key is twice in the dictionary εos combines the metadata of the values.
Result: double sentences are removed.
hadoop jar net.sf.eos-toolkit.core-<VERSION>-uberjar-executable.jar \ net.sf.eos.hadoop.mapred.decompose.SentenceMapReduceDriver \ -D net.sf.eos.hadoop.mapred.AbstractKeyGenerator.impl=net.sf.eos.hadoop.mapred.decompose.TextMetaKeyGenerator \ -D net.sf.eos.hadoop.mapred.sentencer.TextMetadataKeyGenerator.metaKey=EosDocument/date \ --source <SOURCE FOLDER> \ --dest <DESTINATION FOLDER>
The second step is the co-occurrence analyzing. Named entities will recognize in the sentences and replaced. In the common use-case εos identifies named entities thru a simple longest match look up. εos replace the named entity in the sentence thru a concept identifier (ID). For each replaced named entity εos puts the ID as a key in a map. The value is the sentence with the replaced ID. For each equal key εos concatenates all sentences of the ID and the metadata of the sentence and removes the key ID from the sentence. All other IDs are still in the sentence.
Result: A new document for each reconized ID.
hadoop jar net.sf.eos-toolkit.core-<VERSION>-uberjar-executable.jar \ net.sf.eos.hadoop.mapred.cooccurrence.DictionaryBasedEntityRecognizerMapReduceDriver \ -Dnet.sf.eos.hadoop.mapred.cooccurrence.DictionaryBasedEntityRecognizerReducer.metaKeys=EosDocument/creator \ --trie <PATH TO TRIE> \ --source <SOURCE FOLDER> \ --dest <DESTINATION FOLDER>
The third and last step creates a Lucene based fulltext index over all concatenates sentences.
Result: A Lucene fulltext index.
hadoop jar net.sf.eos-toolkit.core-<VERSION>-uberjar-executable.jar \ net.sf.eos.hadoop.mapred.index.IndexMapReduceDriver \ --source <SOURCE FOLDER> \ --dest <DESTINATION FOLDER>
After indexing optimize the index by calling
hadoop jar net.sf.eos-toolkit.core-<VERSION>-uberjar-executable.jar \ net.sf.eos.hadoop.mapred.index.IndexMerger \ -workingdir <WORKING FOLDER> \ <DESTINATION FOLDER> \ <SOURCE FOLDER>
TODO