we have multiple ruta script which is setup to run sequentially on incoming emails. Is it a good idea to create seeds and rutabasic annotations once and use them to execute multiple ruta script one by one and once all the scripts are executed we empty the cas.
CAS cas = jCas.getCas();
//initialize the seeds and ruta basic
for (String rutaScript : rutaScripts) {
//execute the ruta one by one
}
//clear the cas
The TokenSeed annotations are commonly only created once as they should represent some simple static layer. The RutaTokenSeedAnnotator, for example, creates only new annotations if there are no TokenSeed annotations yet. They can be shared like any other annotation.
The RutaBasic annotations store additional information about the annotations. They need to be updated for each addition or removal of any annotations, i.e internal maps need to be up to date all the time or else the rules will be executed incorrectly. The RutaBasic annotations can be shared across different analysis engines processing the same CAS and the RutaEngine provides parameters configuring the internal update strategy. These parameters are named PARAM_INDEX_** or PARAM_REINDEX_**.
If there are only two consecutive RutaEngines in your pipeline, then you can set PARAM_REINDEX_UPDATE_MODE to NONE as no other analysis engine modified the indexes.
If runtime is not an issue, then you can set PARAM_REINDEX_UPDATE_MODE to COMPLETE and the RutaEngine will update everything.
If you know that analysis engines in between two RutaEngine do not remove any annotations, then you can set PARAM_REINDEX_UPDATE_MODE to ADDITIVE. The internal update is faster in this mode.
The other parameters can be used to optimize different aspects of the Ruta indexing and reindexing improving speed as well as memory consumption.