google-cloud-platformgoogle-apigoogle-natural-language

Google Cloud Natural Language API: Does analyzeSyntax API have a param for tokens to not include "*_UNKNOWN" value attributes in partOfSpeech result


I'm wondering if there is any way that the API endpoint allows for the analyzeSyntax API response JSON to not include sub-attributes of partOfSpeech dictionaries if they are *_UNKNOWN? When looking at details around the document input, I can't find any way to limit the response document contents of partOfSpeech.

Is this something that will only be handled when cleaning the data, post-response?

Example query per API docs here in a file called request.json:

{
  "encodingType": "UTF8",
  "document": {
    "type": "PLAIN_TEXT",
    "content": "Google, headquartered in Mountain View, unveiled the new Android phone at the Consumer Electronic Show.  Sundar Pichai said in his keynote that users love their new Android phones."
  }
}

Command executed:

curl "https://language.googleapis.com/v1/documents:analyzeSyntax?key=${API_KEY}" \
  -s \
  -X POST \
  -H "Content-Type: application/json" \
  --data-binary @request.json > response.json

Sample of response:

{
  "sentences": [
    {
      "text": {
        "content": "Google, headquartered in Mountain View, unveiled the new Android phone at the Consumer Electronic Show.",
        "beginOffset": 0
      }
    },
    {
      "text": {
        "content": "Sundar Pichai said in his keynote that users love their new Android phones.",
        "beginOffset": 105
      }
    }
  ],
  "tokens": [
    {
      "text": {
        "content": "Google",
        "beginOffset": 0
      },
      "partOfSpeech": {
        "tag": "NOUN",
        "aspect": "ASPECT_UNKNOWN",
        "case": "CASE_UNKNOWN",
        "form": "FORM_UNKNOWN",
        "gender": "GENDER_UNKNOWN",
        "mood": "MOOD_UNKNOWN",
        "number": "SINGULAR",
        "person": "PERSON_UNKNOWN",
        "proper": "PROPER",
        "reciprocity": "RECIPROCITY_UNKNOWN",
        "tense": "TENSE_UNKNOWN",
        "voice": "VOICE_UNKNOWN"
      },
      "dependencyEdge": {
        "headTokenIndex": 7,
        "label": "NSUBJ"
      },
      "lemma": "Google"
    },
    {
      "text": {
        "content": ",",
        "beginOffset": 6
      },
      "partOfSpeech": {
        "tag": "PUNCT",
        "aspect": "ASPECT_UNKNOWN",
        "case": "CASE_UNKNOWN",
        "form": "FORM_UNKNOWN",
        "gender": "GENDER_UNKNOWN",
        "mood": "MOOD_UNKNOWN",
        "number": "NUMBER_UNKNOWN",
        "person": "PERSON_UNKNOWN",
        "proper": "PROPER_UNKNOWN",
        "reciprocity": "RECIPROCITY_UNKNOWN",
        "tense": "TENSE_UNKNOWN",
        "voice": "VOICE_UNKNOWN"
      },
      "dependencyEdge": {
        "headTokenIndex": 0,
        "label": "P"
      },
      "lemma": ","
    },
...
...

This response JSON is 819 lines, with 314 lines (nearly 40% of the response!) being *_UNKNOWN values for partOfSpeech attributes. So, completely useless, yet significantly adding to the amount of data in a response.

The documentation doesn't seem to provide parameters that could help with this. Am I missing something, or does this API not support an argument for dropping those keys when they are *_UNKNOWN? Is this something that can only be managed post-response with data cleaning?


Solution

  • If we look at the API specification we eventually find that parts of speech are actually enums (enumerations). For example, we find that Gender can be:

    Making REST API calls sends and receives JSON payload and JSON's abstraction for enums is that their values are expanded strings. However, REST and JSON are not the only protocols for making GCP service requests. One can also make gRPC calls. When one uses gRPC, the protocol transmitted is a protocol buffer. There are language bindings from Google that allow you to make service calls using gRPC without having to get distracted with learning that technology. The value of gRPC is that the messages are much smaller and faster.

    I have seen no mechanism to accommodate transport compression at the API level (such as asking for fields not to be included in a JSON response when using REST).

    See also: