javauimacleartk

String IN, String OUT?


I'm new to ClearTK and UIMA. So far I couldn't find any examples on how to create a pipeline where no files are involved.

I'm trying to process a small text stored in a Java String variable using cleartk and UIMA, and get an XML String back (outcome of the ClearTK TimeML annotators).

I was able to provide a String as input (see code excerpt), but the code is far from elegant (needed to execute set and empty URI to the CAS.) Also, the output is being saved to a file, but I want to get a String back (it does not make sense to have the output saved to a file and then read the file back into memory).

import org.apache.uima.analysis_component.JCasAnnotator_ImplBase;
import org.apache.uima.analysis_engine.AnalysisEngineProcessException;
import org.apache.uima.fit.factory.AnalysisEngineFactory;
import org.apache.uima.fit.pipeline.SimplePipeline;
import org.apache.uima.jcas.JCas;
import org.cleartk.corpus.timeml.TempEval2007Writer;
import org.cleartk.opennlp.tools.PosTaggerAnnotator;
import org.cleartk.snowball.DefaultSnowballStemmer;
import org.cleartk.timeml.event.*;
import org.cleartk.timeml.time.TimeTypeAnnotator;
import org.cleartk.timeml.tlink.TemporalLinkEventToDocumentCreationTimeAnnotator;
import org.cleartk.timeml.tlink.TemporalLinkEventToSameSentenceTimeAnnotator;
import org.cleartk.timeml.tlink.TemporalLinkEventToSubordinatedEventAnnotator;
import org.cleartk.timeml.type.DocumentCreationTime;
import org.cleartk.token.tokenizer.TokenAnnotator;
import org.cleartk.util.cr.FilesCollectionReader;

...

String documentText = "First make sure that you are using eggs that are several days old...";
JCas sourceCas = createJCas();

sourceCas.setDocumentText(documentText);
ViewUriUtil.setURI(sourceCas, new URI(""));

SimplePipeline.runPipeline(
        sourceCas,
        org.cleartk.opennlp.tools.SentenceAnnotator.getDescription(),
        TokenAnnotator.getDescription(),
        PosTaggerAnnotator.getDescription(),
        DefaultSnowballStemmer.getDescription("English"),
        org.cleartk.opennlp.tools.ParserAnnotator.getDescription(),
        org.cleartk.timeml.time.TimeAnnotator.FACTORY.getAnnotatorDescription(),
        TimeTypeAnnotator.FACTORY.getAnnotatorDescription(),
        EventAnnotator.FACTORY.getAnnotatorDescription(),
        EventTenseAnnotator.FACTORY.getAnnotatorDescription(),
        EventAspectAnnotator.FACTORY.getAnnotatorDescription(),
        EventClassAnnotator.FACTORY.getAnnotatorDescription(),
        EventPolarityAnnotator.FACTORY.getAnnotatorDescription(),
        EventModalityAnnotator.FACTORY.getAnnotatorDescription(),
        AnalysisEngineFactory.createEngineDescription(AddEmptyDCT.class),
        TemporalLinkEventToDocumentCreationTimeAnnotator.FACTORY.getAnnotatorDescription(),
        TemporalLinkEventToSameSentenceTimeAnnotator.FACTORY.getAnnotatorDescription(),
        TemporalLinkEventToSubordinatedEventAnnotator.FACTORY.getAnnotatorDescription(),
        TempEval2007Writer.getDescription("file:///tmp/out.tml"));

What would be the recommended way to have the pipeline take a String as input and produce another String as the execution result?


Solution

  • Run you engines with SimplePipeline like you did, and then retrieve the annotations your are interested in from your sourceCas like this:

    Collection<MyAnnotation> myAnnotation = JCasUtil.select(sourceCas, MyAnnotation.class);
    String myproperty = myAnnotation.getMyproperty();