nlpspacyword-embedding

How to get token ids using spaCy (I want to map a text sentence to sequence of integers)


I want to use spacy to tokenize sentences to get a sequence of integer token-ids that I can use for downstream tasks. I expect to use it something like below. Please fill in ???

import spacy

# Load English tokenizer, tagger, parser, NER and word vectors
nlp = spacy.load('en_core_web_lg')

# Process whole documents
text = (u"When Sebastian Thrun started working on self-driving cars at ")

doc = nlp(text)

idxs = ???

print(idxs)

I want the output to be something like:

array([ 8045, 70727, 24304, 96127, 44091, 37596, 24524, 35224, 36253])

Preferably the integers refers to some special embedding id in en_core_web_lg..

spacy.io/usage/vectors-similarity does not give a hint what attribute in doc to look for.

I asked this on crossvalidated but it was determined as OT. Proper terms for googling/describing this problem is also helpful.


Solution

  • Solution;

    import spacy
    nlp = spacy.load('en_core_web_md')
    text = (u"When Sebastian Thrun started working on self-driving cars at ")
    
    doc = nlp(text)
    
    ids = []
    for token in doc:
        if token.has_vector:
            id = nlp.vocab.vectors.key2row[token.norm]
        else:
            id = None
        ids.append(id)
    
    print([token for token in doc])
    print(ids)
    #>> [When, Sebastian, Thrun, started, working, on, self, -, driving, cars, at]
    #>> [71, 19994, None, 369, 422, 19, 587, 32, 1169, 1153, 41]
    

    Breaking this down;

    # A Vocabulary for which __getitem__ can take a chunk of text and returns a hash
    nlp.vocab 
    # >>  <spacy.vocab.Vocab at 0x12bcdce48>
    nlp.vocab['hello'].norm # hash
    # >> 5983625672228268878
    
    
    # The tensor holding the word-vector
    nlp.vocab.vectors.data.shape
    # >> (20000, 300)
    
    # A dict mapping hash -> row in this array
    nlp.vocab.vectors.key2row
    # >> {12646065887601541794: 0,
    # >>  2593208677638477497: 1,
    # >>  ...}
    
    # So to get int id of 'earth'; 
    i = nlp.vocab.vectors.key2row[nlp.vocab['earth'].norm]
    nlp.vocab.vectors.data[i]
    
    # Note that tokens have hashes but may not have vector
    # (Hence no entry in .key2row)
    nlp.vocab['Thrun'].has_vector
    # >> False