I am using the Google's Natural language analyzeEntities
api and in the response, there is a nested EntityMention.TextSpan
object, with 2 fields: content and beginOffset.
I want to leverage the beginOffset for some further analysis. So I was trying to map the index of words in the original text and compare these to the beginOffset but I noticed the indexes were different.
I am using a fairly naive approach to build this index:
const msg = "it will cost you $350 - $600,. test. Alexander. How are you?"
let index = 0
msg.split(" ").forEach(part => {
console.log(part + ":" + index)
index = index + part.length + 1 // + 1 for the split on space
})
The results are:
it:0
will:3
cost:8
you:13
$350:17
-:22
$600,.:24
test.:31
Alexander.:37
How:48
are:52
you?:56
The result I get from the analyzeEntities api are:
gcloud ml language analyze-entities --content="it will cost you $350 - $600,. test. Alexander. How are you?"
{
"entities": [
{
"mentions": [
{
"text": {
"beginOffset": 23,
"content": "test"
},
"type": "COMMON"
}
],
"metadata": {},
"name": "test",
"salience": 0.7828024,
"type": "OTHER"
},
{
"mentions": [
{
"text": {
"beginOffset": 29,
"content": "Alexander"
},
"type": "PROPER"
}
],
"metadata": {},
"name": "Alexander",
"salience": 0.2171976,
"type": "PERSON"
}
],
"language": "en"
}
I understand that non alphanumeric characters have special meaning and handling and I was expecting the offset to represent the true index.
Since, it is not what are the rules used to parse the query text and how is the beginOffset calculated?
Thanks!
Looks the $
sign is the problem here.
gcloud ml language analyze-entities --content="it will cost you \$350 - \$600,. test. Alexander. How are you?"
{
"entities": [
{
"mentions": [
{
"text": {
"beginOffset": 31,
"content": "test"
},
"type": "COMMON"
}
],
"metadata": {},
"name": "test",
"salience": 0.7828024,
"type": "OTHER"
},
{
"mentions": [
{
"text": {
"beginOffset": 37,
"content": "Alexander"
},
"type": "PROPER"
}
],
"metadata": {},
"name": "Alexander",
"salience": 0.2171976,
"type": "PERSON"
},
{
"mentions": [
{
"text": {
"beginOffset": 17,
"content": "$350"
},
"type": "TYPE_UNKNOWN"
}
],
"metadata": {
"currency": "USD",
"value": "350.000000"
},
"name": "$350",
"salience": 0.0,
"type": "PRICE"
},
{
"mentions": [
{
"text": {
"beginOffset": 24,
"content": "$600"
},
"type": "TYPE_UNKNOWN"
}
],
"metadata": {
"currency": "USD",
"value": "600.000000"
},
"name": "$600",
"salience": 0.0,
"type": "PRICE"
},
{
"mentions": [
{
"text": {
"beginOffset": 18,
"content": "350"
},
"type": "TYPE_UNKNOWN"
}
],
"metadata": {
"value": "350"
},
"name": "350",
"salience": 0.0,
"type": "NUMBER"
},
{
"mentions": [
{
"text": {
"beginOffset": 25,
"content": "600"
},
"type": "TYPE_UNKNOWN"
}
],
"metadata": {
"value": "600"
},
"name": "600",
"salience": 0.0,
"type": "NUMBER"
}
],
"language": "en"
}
If you change $
sign to #
it seems to work as expected.
gcloud ml language analyze-entities --content="it will cost you #350 - #600,. test. Alexander. How are you?"
{
"entities": [
{
"mentions": [
{
"text": {
"beginOffset": 31,
"content": "test"
},
"type": "COMMON"
}
],
"metadata": {},
"name": "test",
"salience": 0.9085014,
"type": "OTHER"
},
{
"mentions": [
{
"text": {
"beginOffset": 37,
"content": "Alexander"
},
"type": "PROPER"
}
],
"metadata": {},
"name": "Alexander",
"salience": 0.09149864,
"type": "PERSON"
},
{
"mentions": [
{
"text": {
"beginOffset": 18,
"content": "350"
},
"type": "TYPE_UNKNOWN"
}
],
"metadata": {
"value": "350"
},
"name": "350",
"salience": 0.0,
"type": "NUMBER"
},
{
"mentions": [
{
"text": {
"beginOffset": 25,
"content": "600"
},
"type": "TYPE_UNKNOWN"
}
],
"metadata": {
"value": "600"
},
"name": "600",
"salience": 0.0,
"type": "NUMBER"
}
],
"language": "en"
}