I am working on an NLP project for sentiment analysis. I am using SpaCy to tokenize sentences. As I was reading the documentation, I learned about NER. I've read that it can be used to extract entities from text for aiding a user's searching.
The thing I am trying to understand is how to embody it (if I should) in my tokenization process. I am giving an example.
text = "Let's not forget that Apple Pay in 2014 required a brand new iPhone in order to use it. A significant portion of Apple's user base wasn't able to use it even if they wanted to. As each successive iPhone incorporated the technology and older iPhones were replaced the number of people who could use the technology increased."
sentence = sp(text) # sp = spacy.load('en_core_web_sm')
for word in sentence:
print(word.text)
# Let
# 's
# not
# forget
# that
# Apple
# Pay
# in
# etc...
for word in sentence.ents:
print(word.text + " _ " + word.label_ + " _ " + str(spacy.explain(word.label_)))
# Apple Pay _ ORG _ Companies, agencies, institutions, etc.
# 2014 _ DATE _ Absolute or relative dates or periods
# iPhone _ ORG _ Companies, agencies, institutions, etc.
# Apple _ ORG _ Companies, agencies, institutions, etc.
# iPhones _ ORG _ Companies, agencies, institutions, etc.
The first loops shows that 'Apple' and 'Pay' are different tokens. When printing the discovered entities in the second loop, it understands that 'Apply Pay' is an ORG. If yes, how could I achieve that (let's say) "type" of tokenization?
My thinking is, shouldn't 'Apple' and 'Pay' be tokenized as a single word together so that, when I create my classifier it will recognize it as an entity and not recognize a fruit ('Apple') and a verb ('Pay').
Tokenization typically is the splitting of a sentence into words or even subwords. I am not sure what you later plan to do with the data, but it is a convention in NLP to stick to either the document level, sentence level or word/token level. Having some mix of token and n-gram level (like ["Apple Pay", "required", "an", "iPhone", "to", "use", "it", "."]
in my opinion will not help you in most later use cases.
If you later train a classifier (assuming you're talking about fine-tuning a transformer based language model on a token classification task) would then use something like the IOB format to handle n-grams, e.g. like so:
Token | Label |
---|---|
Apple | B |
Pay | I |
required | O |
an | O |
iPhone | B |
to | O |
use | O |
it | O |
. | O |
Of course this depends on your application and directly merging to n-grams might work well for you. If you have some application where you are searching for frequent n-grams, you could use collocation metrics to extract those n-grams, e.g. using NLTK's CollocationFinder.
Or as you mentioned use SpaCy either for noun chunk extraction or named entity recognition. For the latter one, you could access the token level ent_type_ and ent_iob_ attributes to iterate over the tokens in the processed docs once and then merge these n-grams together based on their IOB-tags.