pythongcloudgoogle-cloud-language

Google Cloud API output format in Python


If I call the function analyse_syntax from Google Cloud Python library, it returns

document = types.Document(content='Tried this', type=enums.Document.Type.PLAIN_TEXT)
info = client.analyze_syntax(document=document)
print(info)
sentences {
  text {
    content: "Tried this"
    begin_offset: -1
  }
}
tokens {
  text {
    content: "Tried"
    begin_offset: -1
  }
  part_of_speech {
    tag: VERB
    mood: INDICATIVE
    tense: PAST
  }
  dependency_edge {
    label: ROOT
  }
  lemma: "try"
}
tokens {
  text {
    content: "this"
    begin_offset: -1
  }
  part_of_speech {
    tag: DET
    number: SINGULAR
  }
  dependency_edge {
    label: DOBJ
  }
  lemma: "this"
}
language: "en"
print(info.tokens)
[text {
  content: "Tried"
  begin_offset: -1
}
part_of_speech {
  tag: VERB
  mood: INDICATIVE
  tense: PAST
}
dependency_edge {
  label: ROOT
}
lemma: "try"
, text {
  content: "this"
  begin_offset: -1
}
part_of_speech {
  tag: DET
  number: SINGULAR
}
dependency_edge {
  label: DOBJ
}
lemma: "this"
]
print(info.tokens[0].part_of_speech)
tag: VERB
mood: INDICATIVE
tense: PAST

which is a weird format to me, because:

QUESTION: What type of object is that and how does it work?

I wanted to be able to convert it to a dictionary (in a better way than converting it to string first) or iterate through it somehow (find which keys it has and there corresponding values).


Solution

  • First thing you can do in order to get the type of an object in python is to call built-in function type()

    part_of_speech_0 = info.tokens[0].part_of_speech
    print(type(part_of_speech_0))
    

    Which you will see returns as output

    <class 'google.cloud.language_v1.types.PartOfSpeech'>
    

    This is, a class defined in the own google cloud NLP library.

    Something you can also do is to see what attributes this class has by using also built-in function dir():

    print(dir(part_of_speech_0))
    

    That results in:

    ['ACCUSATIVE', 'ACTIVE', 'ADJ', 'ADNOMIAL', 'ADP', 'ADV', 'ADVERBIAL', 'AFFIX', 'ASPECT_UNKNOWN', 'AUXILIARY', 'Aspect', 'ByteSize', 'CASE_UNKNOWN', 'CAUSATIVE', 'COMPLEMENTIVE', 'COMPLEMENTIZER', 'CONDITIONAL_MOOD', 'CONDITIONAL_TENSE', 'CONJ', 'Case', 'Clear', 'ClearExtension', 'ClearField', 'CopyFrom', 'DATIVE', 'DESCRIPTOR', 'DET', 'DUAL', 'DiscardUnknownFields', 'Extensions', 'FEMININE', 'FINAL_ENDING', 'FIRST', 'FORM_UNKNOWN', 'FUTURE', 'FindInitializationErrors', 'Form', 'FromString', 'GENDER_UNKNOWN', 'GENITIVE', 'GERUND', 'Gender', 'HasExtension', 'HasField', 'IMPERATIVE', 'IMPERFECT', 'IMPERFECTIVE', 'INDICATIVE', 'INSTRUMENTAL', 'INTERROGATIVE', 'IRREALIS', 'IsInitialized', 'JUSSIVE', 'LOCATIVE', 'LONG', 'ListFields', 'MASCULINE', 'MOOD_UNKNOWN', 'MergeFrom', 'MergeFromString', 'Mood', 'NEUTER', 'NOMINATIVE', 'NON_RECIPROCAL', 'NOT_PROPER', 'NOUN', 'NUM', 'NUMBER_UNKNOWN', 'Number', 'OBLIQUE', 'ORDER', 'PARTITIVE', 'PASSIVE', 'PAST', 'PERFECTIVE', 'PERSON_UNKNOWN', 'PLUPERFECT', 'PLURAL', 'PREPOSITIONAL', 'PRESENT', 'PROGRESSIVE', 'PRON', 'PROPER', 'PROPER_UNKNOWN', 'PRT', 'PUNCT', 'ParseFromString', 'Person', 'Proper', 'REALIS', 'RECIPROCAL', 'RECIPROCITY_UNKNOWN', 'REFLEXIVE_CASE', 'REFLEXIVE_PERSON', 'RELATIVE_CASE', 'Reciprocity', 'RegisterExtension', 'SECOND', 'SHORT', 'SINGULAR', 'SPECIFIC', 'SUBJUNCTIVE', 'SerializePartialToString', 'SerializeToString', 'SetInParent', 'TENSE_UNKNOWN', 'THIRD', 'Tag', 'Tense', 'UNKNOWN', 'UnknownFields', 'VERB', 'VOCATIVE', 'VOICE_UNKNOWN', 'Voice', 'WhichOneof', 'X', '_CheckCalledFromGeneratedFile', '_SetListener', '__class__', '__deepcopy__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__slots__', '__str__', '__subclasshook__', '__unicode__', '_extensions_by_name', '_extensions_by_number', 'aspect', 'case', 'form', 'gender', 'mood', 'number', 'person', 'proper', 'reciprocity', 'tag', 'tense', 'voice']
    

    As you can see, this object seems to have as attributes all of the possible keys and values of what could look as something as a dictionary. If you further inspect the attributes like 'VERB' or 'tag' you will see all of them are integers. The way this object stores information is by mattching the key integer to the value integer, that is why 'tag' returns '11', because that is precisely the integer associated to 'VERB' (you can check this also with 'mood' and 'INDICATIVE' (both are '3') and 'tense' and 'PAST' (both are also '3')). On the contrary, those keys that don't have an associated value (like 'person', or 'gender') are given the value of 0.

    Now, coming back to a way of iterating this item, you can see that the string returned when you call 'part_of_speech_0' has a YAML like structure. you can thus turn this into a dictionary by loading it using the yaml module in python. Here's the final complete code that woud output the iteration of (key, value) pairs in 'part_of_speech':

    from google.cloud import language
    from google.cloud.language import enums
    from google.cloud.language import types
    import yaml
    
    
    client = language.LanguageServiceClient()
    
    document = types.Document(content='Tried this', type=enums.Document.Type.PLAIN_TEXT)
    
    info = client.analyze_syntax(document=document)
    part_of_speech_0 = info.tokens[0].part_of_speech
    
    part_0_yaml = yaml.load(str(part_of_speech_0))
    #casts part_of_speech into a string and loads that into a dictionary assuming YAML structure
    
    
    for key, value in part_0_yaml.items():
        print('key: {}, value: {}'.format(key, value))
    #iterates the created dictionary