pythonspacypos-tagger

Search for particular parts of speech (e.g. nouns) and print them along with a preceding word


I have a text which is made up of a list of basic sentences, such as "she is a doctor", "he is a good person", and so forth. I'm trying to write a program which will return only the nouns and the preceding pronoun (e.g. she, he, it). I need them to print as a pair, for example (she, doctor) or (he, person). I'm using SpaCy as this will allow me to work with similar texts in French and German as well.

This is the closest thing I've found elsewhere on this site as to what I need. What I've been trying so far is to produce a list of nouns in the text and then search the text for nouns in the list, and print the noun and the word 3 places before it (since this is the pattern for most of the sentences, and most is good enough for my purposes). This is what I've got for creating the list:

def spacy_tag(text):
  text_open = codecs.open(text, encoding='latin1').read()
  parsed_text = nlp_en(text_open)
  tokens = list([(token, token.tag_) for token in parsed_text])
  list1 = []
  for token, token.tag_ in tokens:
    if token.tag_ == 'NN':
      list1.append(token)
  return(list1)

However, when I try to do anything with it, I get an error message. I've tried using enumerate but I couldn't get that to work either. This is the current code I have for searching the text for the words in the list (I haven't gotten around to adding the part which should print the word several places beforehand as I'm still stuck on the searching part):

def spacy_search(text, list):
  text_open = codecs.open(text, encoding='latin1').read()
  for word in text_open:
   if word in list:
     print(word)

The error I get is at line 4, "if word in list:", and it says "TypeError: Argument 'other' has incorrect type (expected spacy.tokens.token.Token, got str)"

Is there a more efficient way of printing a PRP, NN pair using SpaCy? And alternatively, how can I amend my code to work so it searches the text for the nouns in the list? (It doesn't need to be a particularly elegant solution, it just needs to produce a result).


Solution

  • Here is a clean way to implement your intended approach.

    # put your nouns of interest here
    NOUN_LIST = ["doctor", ...]
    
    def find_stuff(text):
        doc = nlp(text)
        if len(doc) < 4: return None # too short
        
        for tok in doc[3:]:
            if tok.pos_ == "NOUN" and tok.text in NOUN_LIST and doc[tok.i-3].pos_ == "PRON":
                return (doc[tok.i-3].text, tok.text)
    

    As the other answer mentioned, your approach here is wrong though. You want the subject and object (or predicate nominative) of the sentence. You should use the DependencyMatcher for that. Here's an example:

    from spacy.matcher import DependencyMatcher
    import spacy
    
    nlp = spacy.load("en_core_web_sm")
    doc = nlp("she is a good person")
    
    pattern = [
      # anchor token: verb, usually "is"
      {
        "RIGHT_ID": "verb",
        "RIGHT_ATTRS": {"POS": "AUX"}
      },
      # verb -> pronoun
      {
        "LEFT_ID": "verb",
        "REL_OP": ">",
        "RIGHT_ID": "pronoun",
        "RIGHT_ATTRS": {"DEP": "nsubj", "POS": "PRON"}
      },
      # predicate nominatives have "attr" relation
      {
        "LEFT_ID": "verb",
        "REL_OP": ">",
        "RIGHT_ID": "target",
        "RIGHT_ATTRS": {"DEP": "attr", "POS": "NOUN"}
      }
    ]
    
    matcher = DependencyMatcher(nlp.vocab)
    matcher.add("PREDNOM", [pattern])
    matches = matcher(doc)
    
    for match_id, (verb, pron, target) in matches:
        print(doc[pron], doc[verb], doc[target])
    

    You can check dependency relations using displacy. You can learn more about what they are in the Jurafsky and Martin book.