Introduction
eos-toolkit core is the base implementation of εos. εos stands for entity oriented search. Eos
is also the name of the Greek mythology goddess of Aurora (Greek: Ηώς).
εos major task is to identify concordance
(index) lists of related named entities from a text corpus. To support this task εos should offer a bunch of tools and concepts to use the whole chain to create different application based on it. Its also a target to offer an out-of-the-box implementation for a common use-case
.
Possible applications of the entity oriented search in unstructured text with or without metadata are:
- Enrich news search
- Based on the concordance in timeline oriented search εos should offer near by
named entities. E.g. in April 2008 searching for "Hillary Clinton" may offer you the concordance of the named entities "Barack Obama" and "John McCain".
- Explore timeline based named entity occurrence
- This may be a use-case for researcher in the biomedical domain. Explore the named entity of "Dopamine" in a timeline based context to "Parkinson's disease". What is an upcoming named entity in your research domain?
- Improve lexicon viewing
- Offer the user of an encyclopedia entries which are in context of the observed entry.
Based on
εos based on two major open source projects:
- Lucene
- Lucene
is the backbone of the retrieval side. εos heavily based on the tf-idf
and the fulltext retrieval functions of Lucene.
- Hadoop
- Hadoop
is the backbone of the analyzing side of εos. Cause it takes long time to create a Lucene index for the retrieval side of εos. Hadoop is a strong opportunity to create such an index in an acceptable time for the online search business.
Next Tasks
- Create use-case web-service for Wikipedia based entity oriented search.
- Add contribution code to transform Wikipedia into EosDocuemts
inside of an Hadoop cluster.
- Improve documentation.
- Setup development environment to better user support (e.g. Mailing list, Wiki, Issue Tracker)