I am setting up a new framework based on UIMA and DKPro-Core. The framework provides an easy way to configure UIMA pipelines.
Inside some JCasAnnotator
I want to compare the data of all documents (in this case, I assume, different JCas objects) pairwise or against the single JCas
.
runPipeline(createReaderDescription(SomeReader.class),
somePreprocessingEngineDescription,
similarityPipelineDescription,
createEngineDescription(SomeWriter.class)
);
Inside similarityPipelineDescription
I would like to compare the data for one JCas
against all JCas
.
public void process(JCas aJCas) throws AnalysisEngineProcessException {
// Compare aJcas with all other JCas objects
}
Is this the recommended way someone could and would do this?
If so, how can I gain access to the other JCas
objects?
Or should I save the data that I want to compare and compare it later?
The (J)Cas objects that are passed through a pipeline are typically re-used. So it does not make sense to keep collect references to them in an analysis engine and it also won't work to keep references to any feature structures you obtain from them because they will be invalidated when the (J)Cas is reset and reused.
You can write the data to disk and later read it into multiple CAS objects which you then can compare.
Alternatively, you could implement an analysis engine which extracts the data you are interested in into an independent set of objects and base your comparison on that.
If you want to do pair comparisons, you could also implement a reader which reads the data you wish to compare into two different views of the same CAS and then have your analysis engines process / compare that. See for example the DKPro TC PairReader_ImplBase and its subclasses.