net.sf.eos.sentence
Class Sentencer

java.lang.Object
  extended by net.sf.eos.config.Configured
      extended by net.sf.eos.sentence.Sentencer
All Implemented Interfaces:
Configurable
Direct Known Subclasses:
DefaultSentencer

public abstract class Sentencer
extends Configured

The implementation fragmented EosDocument with more then one sentence in a lot of sentences with maybe only one sentence. Each sentence is also represented by a hashcode. The hashcode is able to support removing double sentences from a corpus.

Author:
Sascha Kohlmann

Field Summary
static String DEFAULT_MESSAGE_DIGEST
          The default message digest algorithm.
static String MESSAGE_DIGEST_CONFIG_NAME
          The name of the algorithm of the message digest.
static String SENTENCER_IMPL_CONFIG_NAME
          The configuration key name for the classname of the implementation.
 
Constructor Summary
protected Sentencer()
          Creates a new instance.
 
Method Summary
protected  MessageDigest createDigester()
          Returns the message digest implementation.
static Sentencer newInstance(Configuration config)
          Creates a new instance of a of the implementation.
abstract  Map<String,EosDocument> toSentenceDocuments(EosDocument doc, SentenceTokenizer sentencer, ResettableTokenizer tokenizer, TextBuilder builder)
          Fragments a document into documents of sentences.
 
Methods inherited from class net.sf.eos.config.Configured
configure, getConfiguration
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

DEFAULT_MESSAGE_DIGEST

public static final String DEFAULT_MESSAGE_DIGEST
The default message digest algorithm.

See Also:
Constant Field Values

MESSAGE_DIGEST_CONFIG_NAME

@ConfigurationKey(type=CLASSNAME,
                  defaultValue="md5",
                  description="The message digest.")
public static final String MESSAGE_DIGEST_CONFIG_NAME
The name of the algorithm of the message digest.

See Also:
Constant Field Values

SENTENCER_IMPL_CONFIG_NAME

@ConfigurationKey(type=CLASSNAME,
                  description="Configuration key of the sentencer.")
public static final String SENTENCER_IMPL_CONFIG_NAME
The configuration key name for the classname of the implementation.

See Also:
newInstance(Configuration), Constant Field Values
Constructor Detail

Sentencer

protected Sentencer()
Creates a new instance.

Method Detail

newInstance

@FactoryMethod(key="net.sf.eos.sentence.Sentencer.impl",
               implementation=DefaultSentencer.class)
public static final Sentencer newInstance(Configuration config)
                                   throws EosException
Creates a new instance of a of the implementation. If the Configuration contains a key SENTENCER_IMPL_CONFIG_NAME a new instance of the classname in the value will instantiate. The DefaultSentencer will instantiate if there is no value setted.

Parameters:
config - the configuration
Returns:
a new instance
Throws:
EosException - if it is not possible to instantiate an instance

createDigester

protected MessageDigest createDigester()
                                throws EosException
Returns the message digest implementation. If the configuration contains no value for the key MESSAGE_DIGEST_CONFIG_NAME the default digest will be used.

Returns:
the message digest
Throws:
EosException - if it is not possible to create the message digest

toSentenceDocuments

public abstract Map<String,EosDocument> toSentenceDocuments(EosDocument doc,
                                                            SentenceTokenizer sentencer,
                                                            ResettableTokenizer tokenizer,
                                                            TextBuilder builder)
                                                     throws EosException
Fragments a document into documents of sentences. The return value is a map of message digests and sentenced document. The documents of the return value has all metada data of the original document and maybe additional metadata.

Parameters:
doc - the document to fragment
sentencer - a sentencer instance
tokenizer - a tokenizer instance to tokenize the result of the sentencer
builder - the builder supports the rebuilding of the tokenizer
Returns:
a map of message digest -> document relations
Throws:
EosException - if an error occurs


Copyright © 2008. All Rights Reserved.