uimaruta

uima wordlist missing entries


using uima ruta 2.7.0

DECLARE Substance;
WORDLIST EnzymeSearchList = 'enzyme.txt';
Document{-> MARKFAST(Substance, EnzymeSearchList, true)}; // true ignores case

enzyme.txt contains ~ 16.000 entries (=lines)

If I use a file containing few entries, for example 5, my further rules work without any problem. Once I provide the full list of thousands of entries, my results are incomplete.

Can be the issue caused by reaching WORDLIST limit? Or heap maybe? Nothing fails upon program execution.

I have found a thread specifically stating

There is no maximum size for the wordlists in UIMA Ruta. ... My largest wordlist consisted of about 500k entries


Solution

  • I assume that you mean by incomplete that several (obivous) entities have not been found/annotated in the document?

    This is most likely caused by whitespaces in the enzyme.txt file. Can you verify this, e.g., be removing all whitespace in this file and retest the script

    If the problem is caused by whitespaces, there are several options to solve/avoid this. You can for example set the config param 'dictRemoveWS' to true for automatically removing the whitepaces when the dictionary is loaded.

    Is upgrading to UIMA Ruta 2.8.1 (which should also fix this problem) an option?