I'm looking to capture measurements using Stanford CoreNLP. (If you can suggest a different extractor, that is fine too.)
For example, I want to find 15kg, 15 kg, 15.0 kg, 15 kilogram, 15 lbs, 15 pounds, etc. But among CoreNLPs extraction rules, I don't see one for measurements.
Of course, I can do this with pure regexes, but toolkits can run more quickly, and they offer the opportunity to chunk at a higher level, e.g. to treat gb and gigabytes together, and RAM and memory as building blocks--even without full syntactic parsing--as they build bigger units like 128 gb RAM and 8 gigabytes memory.
I want an extractor for this that is rule-based, not machine-learning-based), but don't see one as part of RegexNer or elsewhere. How do I go about this?
IBM Named Entity Extraction can do this. The regexes are run in an efficient way rather than passing the text through each one. And the regexes are bundled to express meaningful entities, as for example one that unites all the measurement units into a single concept.
I don't think a rule-based system exists for this particular task. However, it shouldn't be hard to make with TokensregexNER. For example, a mapping like:
[{ner:NUMBER}]+ /(k|m|g|t)b/ memory? MEMORY
[{ner:NUMBER}]+ /"|''|in(ches)?/ LENGTH
...
You could try using vanilla TokensRegex as well, and then just extract out the relevant value with a capture group:
(?$group_name [{ner:NUMBER}]+) /(k|m|g|t)b/ memory?