python-3.xnlptokenizespacyn-gram

Is there a bi gram or tri gram feature in Spacy?


The below code breaks the sentence into individual tokens and the output is as below

 "cloud"  "computing"  "is" "benefiting"  " major"  "manufacturing"  "companies"


import en_core_web_sm
nlp = en_core_web_sm.load()

doc = nlp("Cloud computing is benefiting major manufacturing companies")
for token in doc:
    print(token.text)

What I would ideally want is, to read 'cloud computing' together as it is technically one word.

Basically I am looking for a bi gram. Is there any feature in Spacy that allows Bi gram or Tri grams ?


Solution

  • Spacy allows the detection of noun chunks. So to parse your noun phrases as single entities do this:

    1. Detect the noun chunks https://spacy.io/usage/linguistic-features#noun-chunks

    2. Merge the noun chunks

    3. Do dependency parsing again, it would parse "cloud computing" as single entity now.

    >>> import spacy
    >>> nlp = spacy.load('en')
    >>> doc = nlp("Cloud computing is benefiting major manufacturing companies")
    >>> list(doc.noun_chunks)
    [Cloud computing, major manufacturing companies]
    >>> for noun_phrase in list(doc.noun_chunks):
    ...     noun_phrase.merge(noun_phrase.root.tag_, noun_phrase.root.lemma_, noun_phrase.root.ent_type_)
    ... 
    Cloud computing
    major manufacturing companies
    >>> [(token.text,token.pos_) for token in doc]
    [('Cloud computing', 'NOUN'), ('is', 'VERB'), ('benefiting', 'VERB'), ('major manufacturing companies', 'NOUN')]