I'm basically trying to replicate the functionality of Google Places AutoComplete with ElasticSearch.
I have all places indexed on a single field, such as "Columbia, South Carolina 29044". The goal is to allow for autocomplete / typeahead functionality where if the user types "Columbia, SC", "2904", or "Columbia, South Carolina" then user is presented with the aforementioned option (assuming matching options are sparse enough for it to show).
The most obvious problem I'm running into right now is that the synonym filter
is being tokenized and producing erroneous concoctions.
My index:
{
"settings": {
"analysis": {
"analyzer": {
"stateAnalyzer": {
"tokenizer": "autocomplete",
"filter": [
"lowercase",
"asciifolding",
"synonymFilter"
]
}
},
"tokenizer": {
"autocomplete": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 30,
"token_chars": ["letter", "digit"]
}
},
"filter": {
"synonymFilter": {
"type": "synonym",
"synonyms": [
"Florida,FL",
"United States Virgin Islands,VI",
"Montana,MT",
"Minnesota,MN",
"Maryland,MD",
"South Carolina,SC",
"Maine,ME",
"Hawaii,HI",
"District of Columbia,DC",
"Commonwealth of the Northern Mariana Islands,MP",
"Rhode Island,RI",
"Nebraska,NE",
"Washington,WA",
"New Mexico,NM",
"Puerto Rico,PR",
"South Dakota,SD",
"Texas,TX",
"California,CA",
"Alabama,AL",
"Georgia,GA",
"Arkansas,AR",
"Pennsylvania,PA",
"Missouri,MO",
"Utah,UT",
"Oklahoma,OK",
"Tennessee,TN",
"Wyoming,WY",
"Indiana,IN",
"Kansas,KS",
"Idaho,ID",
"Alaska,AK",
"Nevada,NV",
"Illinois,IL",
"Vermont,VT",
"Connecticut,CT",
"New Jersey,NJ",
"North Dakota,ND",
"Iowa,IA",
"New Hampshire,NH",
"Arizona,AZ",
"Delaware,DE",
"Guam,GU",
"American Samoa,AS",
"Kentucky,KY",
"Ohio,OH",
"Wisconsin,WI",
"Oregon,OR",
"Mississippi,MS",
"Colorado,CO",
"North Carolina,NC",
"Virginia,VA",
"West Virginia,WV",
"Louisiana,LA",
"New York,NY",
"Michigan,MI",
"Massachusetts,MA"
],
"expand": true
}
}
}
},
"mappings": {
"properties": {
"fullName": {
"type": "text",
"analyzer": "stateAnalyzer",
"search_analyzer": "stateAnalyzer"
},
"route": {
"type": "text"
}
}
}
}
If I analyze that with the following:
{
"analyzer": "stateAnalyzer",
"text": "columbia SC"
}
It produces, amongst others:
{
"tokens" : [
{
"token" : "co",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 0
},
{
"token" : "co",
"start_offset" : 0,
"end_offset" : 2,
"type" : "SYNONYM",
"position" : 0
},
{
"token" : "col",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 1
},
{
"token" : "col",
"start_offset" : 0,
"end_offset" : 3,
"type" : "SYNONYM",
"position" : 1
},
{
"token" : "colu",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 2
},
{
"token" : "colo",
"start_offset" : 0,
"end_offset" : 4,
"type" : "SYNONYM",
"position" : 2
},
{
"token" : "colum",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 3
},
{
"token" : "color",
"start_offset" : 0,
"end_offset" : 5,
"type" : "SYNONYM",
"position" : 3
},
{
"token" : "columb",
"start_offset" : 0,
"end_offset" : 6,
"type" : "word",
"position" : 4
},
{
"token" : "colora",
"start_offset" : 0,
"end_offset" : 6,
"type" : "SYNONYM",
"position" : 4
},
{
"token" : "columbi",
"start_offset" : 0,
"end_offset" : 7,
"type" : "word",
"position" : 5
},
{
"token" : "colorad",
"start_offset" : 0,
"end_offset" : 7,
"type" : "SYNONYM",
"position" : 5
},
{
"token" : "columbia",
"start_offset" : 0,
"end_offset" : 8,
"type" : "word",
"position" : 6
},
{
"token" : "colorado",
"start_offset" : 0,
"end_offset" : 8,
"type" : "SYNONYM",
"position" : 6
},
{
"token" : "sc",
"start_offset" : 9,
"end_offset" : 11,
"type" : "word",
"position" : 7
},
{
"token" : "so",
"start_offset" : 9,
"end_offset" : 11,
"type" : "SYNONYM",
"position" : 7
},
{
"token" : "sou",
"start_offset" : 9,
"end_offset" : 11,
"type" : "SYNONYM",
"position" : 8
},
{
"token" : "sout",
"start_offset" : 9,
"end_offset" : 11,
"type" : "SYNONYM",
"position" : 9
},
{
"token" : "south",
"start_offset" : 9,
"end_offset" : 11,
"type" : "SYNONYM",
"position" : 10
},
{
"token" : "ca",
"start_offset" : 9,
"end_offset" : 11,
"type" : "SYNONYM",
"position" : 11
},
{
"token" : "car",
"start_offset" : 9,
"end_offset" : 11,
"type" : "SYNONYM",
"position" : 12
},
{
"token" : "caro",
"start_offset" : 9,
"end_offset" : 11,
"type" : "SYNONYM",
"position" : 13
},
{
"token" : "carol",
"start_offset" : 9,
"end_offset" : 11,
"type" : "SYNONYM",
"position" : 14
},
{
"token" : "caroli",
"start_offset" : 9,
"end_offset" : 11,
"type" : "SYNONYM",
"position" : 15
},
{
"token" : "carolin",
"start_offset" : 9,
"end_offset" : 11,
"type" : "SYNONYM",
"position" : 16
},
{
"token" : "carolina",
"start_offset" : 9,
"end_offset" : 11,
"type" : "SYNONYM",
"position" : 17
}
]
}
The issue seems to be that as EaslticSearch is analyzing the text, it sees "colum" which matches to the "Colorado,CO" synonym. However, I can't avoid this because setting the min_gram: 3
results in the error "term: FL was completely eliminated by analyzer"
I guess breaking up the address into parts and indexing each field as a completion
rather than edge_ngram
on each part could resolve some of these issues. The challenge I have there is I don't know how I'd get highlighting to work. I currently have:
{
highlight: {
fields: {
fullName: {
type: 'plain'
}
}
}
}
edit:
copy paste from kibana:
DELETE territories
PUT territories
{
"settings": {
"analysis": {
"analyzer": {
"stateAnalyzer": {
"tokenizer": "autocomplete",
"filter": [
"asciifolding",
"lowercase",
"synonymFilter"
]
}
},
"tokenizer": {
"autocomplete": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 30,
"token_chars": [
"letter",
"digit"
]
}
},
"filter": {
"synonymFilter": {
"type": "synonym",
"synonyms": [
"FL => Florida",
"VI => United States Virgin Islands",
"MT => Montana",
"MN => Minnesota",
"MD => Maryland",
"SC => South Carolina",
"ME => Maine",
"HI => Hawaii",
"DC => District of Columbia",
"MP => Commonwealth of the Northern Mariana Islands",
"RI => Rhode Island",
"NE => Nebraska",
"WA => Washington",
"NM => New Mexico",
"PR => Puerto Rico",
"SD => South Dakota",
"TX => Texas",
"CA => California",
"AL => Alabama",
"GA => Georgia",
"AR => Arkansas",
"PA => Pennsylvania",
"MO => Missouri",
"UT => Utah",
"OK => Oklahoma",
"TN => Tennessee",
"WY => Wyoming",
"IN => Indiana",
"KS => Kansas",
"ID => Idaho",
"AK => Alaska",
"NV => Nevada",
"IL => Illinois",
"VT => Vermont",
"CT => Connecticut",
"NJ => New Jersey",
"ND => North Dakota",
"IA => Iowa",
"NH => New Hampshire",
"AZ => Arizona",
"DE => Delaware",
"GU => Guam",
"AS => American Samoa",
"KY => Kentucky",
"OH => Ohio",
"WI => Wisconsin",
"OR => Oregon",
"MS => Mississippi",
"CO => Colorado",
"NC => North Carolina",
"VA => Virginia",
"WV => West Virginia",
"LA => Louisiana",
"NY => New York",
"MI => Michigan",
"MA => Massachusetts"
],
"expand": true
}
}
}
},
"mappings": {
"properties": {
"fullName": {
"type": "text",
"analyzer": "stateAnalyzer",
"search_analyzer": "stateAnalyzer"
},
"route": {
"type": "text"
}
}
}
}
POST territories/_analyze
{
"analyzer": "stateAnalyzer",
"text": "columbia SC"
}
Alright, I think we are able to achieve this if we re-order the analyzer a bit based on my current understanding. If we postpone generating the Edge Ngrams until after tokenization, we can ensure that we are only tokenizing terms that we are interested in auto-completing.
Columbia SC
will transform into: ["Columbia", "South", "Carolina"]
(before edge-ngramming). SC
will never make it into the inverted index, only the fully qualified terms, even though SC
is still searchable.
Here is your updated analyzer:
PUT territories
{
"settings": {
"analysis": {
"analyzer": {
"stateAnalyzer": {
"tokenizer": "standard",
"filter": [
"asciifolding",
"lowercase",
"synonymFilter",
"edge_ngram_filter"
]
}
},
"filter": {
"edge_ngram_filter": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 5,
"preserve_original": true
},
"synonymFilter": {
"type": "synonym",
"synonyms": [
"FL => Florida",
"VI => United States Virgin Islands",
"MT => Montana",
"MN => Minnesota",
"MD => Maryland",
"SC => South Carolina",
"ME => Maine",
"HI => Hawaii",
"DC => District of Columbia",
"MP => Commonwealth of the Northern Mariana Islands",
"RI => Rhode Island",
"NE => Nebraska",
"WA => Washington",
"NM => New Mexico",
"PR => Puerto Rico",
"SD => South Dakota",
"TX => Texas",
"CA => California",
"AL => Alabama",
"GA => Georgia",
"AR => Arkansas",
"PA => Pennsylvania",
"MO => Missouri",
"UT => Utah",
"OK => Oklahoma",
"TN => Tennessee",
"WY => Wyoming",
"IN => Indiana",
"KS => Kansas",
"ID => Idaho",
"AK => Alaska",
"NV => Nevada",
"IL => Illinois",
"VT => Vermont",
"CT => Connecticut",
"NJ => New Jersey",
"ND => North Dakota",
"IA => Iowa",
"NH => New Hampshire",
"AZ => Arizona",
"DE => Delaware",
"GU => Guam",
"AS => American Samoa",
"KY => Kentucky",
"OH => Ohio",
"WI => Wisconsin",
"OR => Oregon",
"MS => Mississippi",
"CO => Colorado",
"NC => North Carolina",
"VA => Virginia",
"WV => West Virginia",
"LA => Louisiana",
"NY => New York",
"MI => Michigan",
"MA => Massachusetts"
],
"expand": true
}
}
}
},
"mappings": {
"properties": {
"fullName": {
"type": "text",
"analyzer": "stateAnalyzer",
"search_analyzer": "stateAnalyzer"
},
"route": {
"type": "text"
}
}
}
}
One issue with switching from the edge n-gram tokenizer to the filter, is that with the filter it is no longer possible to use elasticsearch's highlighting to highlight just the start of the word based on the query. It will always highlight the entire word (see discussion here).
If you are interested in auto-complete, using suggesters is probably where you are going to end up. Here is what a sample query and output might look like using the above analyzer.
Query:
POST territories/_doc/
{
"fullName": "Columbia, South Carolina 29044"
}
POST territories/_doc/
{
"fullName": "Myrtle Beach, South Carolina 90210"
}
GET territories/_search
{
"query" : {
"match": {
"fullName": "Columbia SC"
}
},
"suggest" : {
"my-suggestion" : {
"text" : "Columbia SC",
"term" : {
"field" : "fullName"
}
}
}
}
Query output:
...
"hits" : [
{
"_index" : "territories",
"_type" : "_doc",
"_id" : "6LyxTnMBxDBOJM21waus",
"_score" : 2.1154594,
"_source" : {
"fullName" : "Columbia, South Carolina 29044"
}
},
{
"_index" : "territories",
"_type" : "_doc",
"_id" : "ury0TnMBxDBOJM21VrAj",
"_score" : 0.7175633,
"_source" : {
"fullName" : "Myrtle Beach, South Carolina 90210"
}
}
]
},
"suggest" : {
"my-suggestion" : [
{
"text" : "co",
"offset" : 0,
"length" : 8,
"options" : [ ]
},
{
"text" : "col",
"offset" : 0,
"length" : 8,
"options" : [ ]
},
{
"text" : "colu",
"offset" : 0,
"length" : 8,
"options" : [ ]
},
{
"text" : "colum",
"offset" : 0,
"length" : 8,
"options" : [ ]
},
{
"text" : "columbia",
"offset" : 0,
"length" : 8,
"options" : [ ]
},
{
"text" : "so",
"offset" : 9,
"length" : 2,
"options" : [ ]
},
{
"text" : "sou",
"offset" : 9,
"length" : 2,
"options" : [ ]
},
{
"text" : "sout",
"offset" : 9,
"length" : 2,
"options" : [ ]
},
{
"text" : "south",
"offset" : 9,
"length" : 2,
"options" : [ ]
},
{
"text" : "ca",
"offset" : 9,
"length" : 2,
"options" : [ ]
},
{
"text" : "car",
"offset" : 9,
"length" : 2,
"options" : [ ]
},
{
"text" : "caro",
"offset" : 9,
"length" : 2,
"options" : [ ]
},
{
"text" : "carol",
"offset" : 9,
"length" : 2,
"options" : [ ]
},
{
"text" : "carolina",
"offset" : 9,
"length" : 2,
"options" : [ ]
}
]
...
You can see the effective equivalent of your _analyze
endpoint there, under the suggest
field.