If I call the function analyse_syntax from Google Cloud Python library, it returns
document = types.Document(content='Tried this', type=enums.Document.Type.PLAIN_TEXT)
info = client.analyze_syntax(document=document)
print(info)
sentences {
text {
content: "Tried this"
begin_offset: -1
}
}
tokens {
text {
content: "Tried"
begin_offset: -1
}
part_of_speech {
tag: VERB
mood: INDICATIVE
tense: PAST
}
dependency_edge {
label: ROOT
}
lemma: "try"
}
tokens {
text {
content: "this"
begin_offset: -1
}
part_of_speech {
tag: DET
number: SINGULAR
}
dependency_edge {
label: DOBJ
}
lemma: "this"
}
language: "en"
print(info.tokens)
[text {
content: "Tried"
begin_offset: -1
}
part_of_speech {
tag: VERB
mood: INDICATIVE
tense: PAST
}
dependency_edge {
label: ROOT
}
lemma: "try"
, text {
content: "this"
begin_offset: -1
}
part_of_speech {
tag: DET
number: SINGULAR
}
dependency_edge {
label: DOBJ
}
lemma: "this"
]
print(info.tokens[0].part_of_speech)
tag: VERB
mood: INDICATIVE
tense: PAST
which is a weird format to me, because:
I can't iterate by (what looks like) keys for key in info.tokens[0].part_of_speech:
gives TypeError: 'PartOfSpeech' object is not iterable
.
Accessing the values doesn't work like I thought: info.tokens[0].part_of_speech.tag
gives the value 11
.
QUESTION: What type of object is that and how does it work?
I wanted to be able to convert it to a dictionary (in a better way than converting it to string first) or iterate through it somehow (find which keys it has and there corresponding values).
First thing you can do in order to get the type of an object in python is to call built-in function type()
part_of_speech_0 = info.tokens[0].part_of_speech
print(type(part_of_speech_0))
Which you will see returns as output
<class 'google.cloud.language_v1.types.PartOfSpeech'>
This is, a class defined in the own google cloud NLP library.
Something you can also do is to see what attributes this class has by using also built-in function dir():
print(dir(part_of_speech_0))
That results in:
['ACCUSATIVE', 'ACTIVE', 'ADJ', 'ADNOMIAL', 'ADP', 'ADV', 'ADVERBIAL', 'AFFIX', 'ASPECT_UNKNOWN', 'AUXILIARY', 'Aspect', 'ByteSize', 'CASE_UNKNOWN', 'CAUSATIVE', 'COMPLEMENTIVE', 'COMPLEMENTIZER', 'CONDITIONAL_MOOD', 'CONDITIONAL_TENSE', 'CONJ', 'Case', 'Clear', 'ClearExtension', 'ClearField', 'CopyFrom', 'DATIVE', 'DESCRIPTOR', 'DET', 'DUAL', 'DiscardUnknownFields', 'Extensions', 'FEMININE', 'FINAL_ENDING', 'FIRST', 'FORM_UNKNOWN', 'FUTURE', 'FindInitializationErrors', 'Form', 'FromString', 'GENDER_UNKNOWN', 'GENITIVE', 'GERUND', 'Gender', 'HasExtension', 'HasField', 'IMPERATIVE', 'IMPERFECT', 'IMPERFECTIVE', 'INDICATIVE', 'INSTRUMENTAL', 'INTERROGATIVE', 'IRREALIS', 'IsInitialized', 'JUSSIVE', 'LOCATIVE', 'LONG', 'ListFields', 'MASCULINE', 'MOOD_UNKNOWN', 'MergeFrom', 'MergeFromString', 'Mood', 'NEUTER', 'NOMINATIVE', 'NON_RECIPROCAL', 'NOT_PROPER', 'NOUN', 'NUM', 'NUMBER_UNKNOWN', 'Number', 'OBLIQUE', 'ORDER', 'PARTITIVE', 'PASSIVE', 'PAST', 'PERFECTIVE', 'PERSON_UNKNOWN', 'PLUPERFECT', 'PLURAL', 'PREPOSITIONAL', 'PRESENT', 'PROGRESSIVE', 'PRON', 'PROPER', 'PROPER_UNKNOWN', 'PRT', 'PUNCT', 'ParseFromString', 'Person', 'Proper', 'REALIS', 'RECIPROCAL', 'RECIPROCITY_UNKNOWN', 'REFLEXIVE_CASE', 'REFLEXIVE_PERSON', 'RELATIVE_CASE', 'Reciprocity', 'RegisterExtension', 'SECOND', 'SHORT', 'SINGULAR', 'SPECIFIC', 'SUBJUNCTIVE', 'SerializePartialToString', 'SerializeToString', 'SetInParent', 'TENSE_UNKNOWN', 'THIRD', 'Tag', 'Tense', 'UNKNOWN', 'UnknownFields', 'VERB', 'VOCATIVE', 'VOICE_UNKNOWN', 'Voice', 'WhichOneof', 'X', '_CheckCalledFromGeneratedFile', '_SetListener', '__class__', '__deepcopy__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__slots__', '__str__', '__subclasshook__', '__unicode__', '_extensions_by_name', '_extensions_by_number', 'aspect', 'case', 'form', 'gender', 'mood', 'number', 'person', 'proper', 'reciprocity', 'tag', 'tense', 'voice']
As you can see, this object seems to have as attributes all of the possible keys and values of what could look as something as a dictionary. If you further inspect the attributes like 'VERB' or 'tag' you will see all of them are integers. The way this object stores information is by mattching the key integer to the value integer, that is why 'tag' returns '11', because that is precisely the integer associated to 'VERB' (you can check this also with 'mood' and 'INDICATIVE' (both are '3') and 'tense' and 'PAST' (both are also '3')). On the contrary, those keys that don't have an associated value (like 'person', or 'gender') are given the value of 0.
Now, coming back to a way of iterating this item, you can see that the string returned when you call 'part_of_speech_0' has a YAML like structure. you can thus turn this into a dictionary by loading it using the yaml module in python. Here's the final complete code that woud output the iteration of (key, value) pairs in 'part_of_speech':
from google.cloud import language
from google.cloud.language import enums
from google.cloud.language import types
import yaml
client = language.LanguageServiceClient()
document = types.Document(content='Tried this', type=enums.Document.Type.PLAIN_TEXT)
info = client.analyze_syntax(document=document)
part_of_speech_0 = info.tokens[0].part_of_speech
part_0_yaml = yaml.load(str(part_of_speech_0))
#casts part_of_speech into a string and loads that into a dictionary assuming YAML structure
for key, value in part_0_yaml.items():
print('key: {}, value: {}'.format(key, value))
#iterates the created dictionary