I have a mongo collection that stores city/country data in multiple languages. For example, the following query:
db.cities_database.find({ "name.pl.country": "Węgry" }).pretty().limit(10);
Returns data in the following format:
[
{
_id: ObjectId('67331d2a9566994a18c505aa'),
geoname_id_city: 714073,
latitude: 46.91667,
longitude: 21.26667,
geohash: 'u2r4guvvmm4m',
country_code: 'HU',
population: 7494,
estimated_radius: 400,
feature_code: 'PPL',
name: {
pl: { city: 'Veszto', admin1: null, country: 'Węgry' },
ascii: { city: 'veszto', admin1: null, country: null },
lt: { city: 'Veszto', admin1: null, country: 'Vengrija' },
ru: { city: 'Veszto', admin1: null, country: 'Венгрия' },
hu: { city: 'Veszto', admin1: null, country: 'Magyarország' },
en: { city: 'Veszto', admin1: null, country: 'Hungary' },
fr: { city: 'Veszto', admin1: null, country: 'Hongrie' }
}
}
...
]
I want to be able to use the same query while using English only characters, so for this example I'd like to query by "name.pl.country": "Wegry"
(Instead character ę
I'd like Mongo to treat it as e
while performing this query).
Is it possible to achieve this?
So far I tried using collation like this:
db.cities_database.find({ "name.pl.country": "Wegry" }).collation({ locale: "pl", strength: 1 }).pretty().limit(10);
but this query doesn't return anything.
I have no knowledge in Polish and I don't know the difference between e
and ę
. But if you use MongoDB Altas, you can set up a customAnalyzer with icuFolding
to perform diacritics-insensitive search.
The index:
{
"analyzer": "diacriticFolder",
"mappings": {
"fields": {
"name": {
"type": "document",
"fields": {
"pl": {
"type": "document",
"fields": {
"country": {
"analyzer": "diacriticFolder",
"type": "string"
}
}
}
}
}
}
},
"analyzers": [
{
"name": "diacriticFolder",
"charFilters": [],
"tokenizer": {
"type": "keyword"
},
"tokenFilters": [
{
"type": "icuFolding"
}
]
}
]
}
$search
query:
[
{
$search: {
"text": {
"query": "Wegry",
"path": "name.pl.country"
}
}
}
]