Describes the usage to call the project converter.
To convert Medline Citation Set documents into EosDocuments, execute:
hadoop jar net.sf.eos-toolkit.contrib.converter-<VERSION>-uberjar-executable.jar \ -inputreader "org.apache.hadoop.streaming.StreamXmlRecordReader,maxrec=1000000,begin=<MedlineCitation ,end=</MedlineCitation>" \ -mapper net.sf.eos.contrib.converter.medline.MedlineMapper \ -reducer net.sf.eos.contrib.converter.medline.MedlineReducer \ -input <SOURCE FOLDER> \ -output <DESTINATION FOLDER>
The MedlineCitation DTD requires at minimum one attribute in the <MedlineCitation> element. This is the main solution to differ the starting character sequence of the element from the parent element <MedlineCitationSet> . Remember the XML specification: the delimiting character between the tag name and the first attribut is a sequence of whitespace characters (S ::= (#x20 | #x9 | #xD | #xA ). The implementation assumes from practice, that the first whitespace character has the value of #x20 . If this assumption creates problems in the future, another solution may be bypass the problem.