python-2.7 nlp named-entity-recognition spacy

Re-Training spaCy's NER v1.8.2 - Training Volume and Mix of Entity Types

I'm in the process of (re-) training spaCy's Named Entity Recognizer and have a couple of doubts that I hope a more experienced researcher/practitioner can help me figure out:

If a few hundred examples are considered 'a good starting point', then what would be a reasonable number to aim for? Is 100 000 entity/label excessive?
If I introduce a new label, is it best if the number of the entities of that labeled are roughly the same (balanced) during training?
Regarding the mixing in 'examples of other entity types':
- do I just add random known categories/labels to my training set eg: ('The Business Standard published in its recent issue on crude oil and natural gas ...', [(4,21, 'ORG')], )?
- can I use the same text for various labels? e.g. ('The Business Standard published in its recent issue on crude oil and natural gas ...', [(55,64, 'COMMODITY')], )?

on a similar note let's assume I want spaCyto also recognize a second COMMODITY could I then just use the same sentence and label a different region e.g. ('The Business Standard published in its recent issue on crude oil and natural gas ...', [(69,80, 'COMMODITY')], )? Is that how it's supposed to be done?
- what ratio between new and other (old) labels is considered reasonable

I'm working with Python2.7 in Ubuntu 16.04 using spaCy 1.8.2

Solution

For a full answer by Matthew Honnibal check out issue 1054 on spaCy's github page. Below are the most important points as they relate to my questions:

Question(Q) 1: If a few hundred examples are considered 'a good starting point', then what would be a reasonable number to aim for? Is 100 000 entity/label excessive?

Answer(A): Every machine learning problem will have a different examples/accuracy curve. You can get an idea for this by training with less data than you have, and seeing what the curve looks like. If you have 1,000 examples, then try training with 500, 750, etc, and see how that affects your accuracy.

Q 2: If I introduce a new label, is it best if the number of the entities of that label are roughly the same (balanced) during training?

A: There's trade-off between making the gradients too sparse, and making the learning problem too unrepresentative of what the actual examples will look like.

Q 3: Regarding the mixing in 'examples of other entity types':

do I just add random known categories/labels to my training set:

A: No, one should annotate all the entities in that text, so the example above: ('The Business Standard published in its recent issue on crude oil and natural gas ...', [(4,21, 'ORG')], ) should be ('The Business Standard published in its recent issue on crude oil and natural gas ...', [(4,21, 'ORG'), (55,64, 'COMMODITY'), (69,80, 'COMMODITY')], )

can I use the same text for various labels?:

A: Not in the way the examples were given. See previous answer.

what ratio between new and other (old) labels is considered reasonable?:

A: See answer Q 2.

PS: Double citations are direct quotes from the github issue answer.