Whether you're using Stanza or Corenlp (now deprecated) python wrappers, or the original Java implementation, the tokenization rules that StanfordCoreNLP follows is super hard for me to figure out from the code in the original codebases.
The implementation is very verbose and the tokenization approach is not really documented. Do they consider this proprietary? On their website, they say that "CoreNLP splits texts into tokens with an elaborate collection of rules, designed to follow UD 2.0 specifications."
I'm looking for where to find those rules, and ideally, to replace CoreNLP (a massive codebase!) with just a regex or something much simpler to mimic their tokenization strategy. Please assume in your responses that Stanford's tokenization approach is the goal. I am not looking for alternative tokenization solutions, but I also very much do not want to include and ship a code base that requires a massive java library as a dependency.
The answer should address the following behavior:
Here are a few notes from one of main authors of it. What you write in your answer is all basically correct, but there are many nuances. 😊
PTBTokenizer
supports both by specifying options. The biggest difference is that the new tokenization splits on most hyphens (except common prefixes, suffixes), which seems to not be what you want.( ) { }
become -LRB- -RRB- -LCB- -RCB-
in LDC tokenization, something you appear not to want.t.tokenize("Independent Living http://www.inlv.demon.nl/.")
['Independent', 'Living', 'http', ':', '//www.inlv.demon.nl/', '.']
# versus CoreNLP output is:
# { "Independent", "Living", "http://www.inlv.demon.nl/", "." }
t.tokenize("I'd've thought that they'd've liked it.")
["I'd", "'ve", 'thought', 'that', "they'd", "'ve", 'liked', 'it', '.']
# versus CoreNLP output is:
# { "I", "'d", "'ve", "thought", "that", "they", "'d", "'ve", "liked", "it", "." }