I implemented with spaCy a toolkit for Text Analysis, structured in 2 levels:
One extension in B is the annotation of token spans with references to nodes in lexical-semantic networks on the cloud, such as BabelNet (WikiData + WordNet + ...). In order to minimize the calls to the BabelNet web API, I would like to filter out lemmas and word forms very common in the general use of the language.
Question:
In spaCy, there are indirect methods to retrieve frequency-related information for a word form or lemma based on the data the model was trained on, though spaCy doesn’t provide explicit frequency counts.