
How is beginoffset calculated for the Google Natural language entities api response?

I am using the Google's Natural language analyzeEntities api and in the response, there is a nested EntityMention.TextSpan object, with 2 fields: content and beginOffset. I want to leverage the beginOffset for some further analysis. So I was trying to map the index of words in the original text and compare these to the beginOffset but I noticed the indexes were different.

I am using a fairly naive approach to build this index:

const msg = "it will cost you $350 - $600,. test. Alexander. How are you?"
let index = 0
msg.split(" ").forEach(part => {
  console.log(part + ":"  + index)
  index = index + part.length + 1 // + 1 for the split on space

The results are:


The result I get from the analyzeEntities api are:

gcloud ml language analyze-entities --content="it will cost you $350 - $600,. test. Alexander. How are you?"                
  "entities": [
      "mentions": [
          "text": {
            "beginOffset": 23,
            "content": "test"
          "type": "COMMON"
      "metadata": {},
      "name": "test",
      "salience": 0.7828024,
      "type": "OTHER"
      "mentions": [
          "text": {
            "beginOffset": 29,
            "content": "Alexander"
          "type": "PROPER"
      "metadata": {},
      "name": "Alexander",
      "salience": 0.2171976,
      "type": "PERSON"
  "language": "en"

I understand that non alphanumeric characters have special meaning and handling and I was expecting the offset to represent the true index.

Since, it is not what are the rules used to parse the query text and how is the beginOffset calculated?



  • Looks the $ sign is the problem here.

    gcloud ml language analyze-entities --content="it will cost you \$350 - \$600,. test. Alexander. How are you?" 
      "entities": [
          "mentions": [
              "text": {
                "beginOffset": 31,
                "content": "test"
              "type": "COMMON"
          "metadata": {},
          "name": "test",
          "salience": 0.7828024,
          "type": "OTHER"
          "mentions": [
              "text": {
                "beginOffset": 37,
                "content": "Alexander"
              "type": "PROPER"
          "metadata": {},
          "name": "Alexander",
          "salience": 0.2171976,
          "type": "PERSON"
          "mentions": [
              "text": {
                "beginOffset": 17,
                "content": "$350"
              "type": "TYPE_UNKNOWN"
          "metadata": {
            "currency": "USD",
            "value": "350.000000"
          "name": "$350",
          "salience": 0.0,
          "type": "PRICE"
          "mentions": [
              "text": {
                "beginOffset": 24,
                "content": "$600"
              "type": "TYPE_UNKNOWN"
          "metadata": {
            "currency": "USD",
            "value": "600.000000"
          "name": "$600",
          "salience": 0.0,
          "type": "PRICE"
          "mentions": [
              "text": {
                "beginOffset": 18,
                "content": "350"
              "type": "TYPE_UNKNOWN"
          "metadata": {
            "value": "350"
          "name": "350",
          "salience": 0.0,
          "type": "NUMBER"
          "mentions": [
              "text": {
                "beginOffset": 25,
                "content": "600"
              "type": "TYPE_UNKNOWN"
          "metadata": {
            "value": "600"
          "name": "600",
          "salience": 0.0,
          "type": "NUMBER"
      "language": "en"

    If you change $ sign to # it seems to work as expected.

    gcloud ml language analyze-entities --content="it will cost you #350 - #600,. test. Alexander. How are you?" 
      "entities": [
          "mentions": [
              "text": {
                "beginOffset": 31,
                "content": "test"
              "type": "COMMON"
          "metadata": {},
          "name": "test",
          "salience": 0.9085014,
          "type": "OTHER"
          "mentions": [
              "text": {
                "beginOffset": 37,
                "content": "Alexander"
              "type": "PROPER"
          "metadata": {},
          "name": "Alexander",
          "salience": 0.09149864,
          "type": "PERSON"
          "mentions": [
              "text": {
                "beginOffset": 18,
                "content": "350"
              "type": "TYPE_UNKNOWN"
          "metadata": {
            "value": "350"
          "name": "350",
          "salience": 0.0,
          "type": "NUMBER"
          "mentions": [
              "text": {
                "beginOffset": 25,
                "content": "600"
              "type": "TYPE_UNKNOWN"
          "metadata": {
            "value": "600"
          "name": "600",
          "salience": 0.0,
          "type": "NUMBER"
      "language": "en"