mongodbmongodb-query

Mongo query to ignore non English characters


I have a mongo collection that stores city/country data in multiple languages. For example, the following query:

db.cities_database.find({ "name.pl.country": "Węgry" }).pretty().limit(10);

Returns data in the following format:

[
  {
    _id: ObjectId('67331d2a9566994a18c505aa'),
    geoname_id_city: 714073,
    latitude: 46.91667,
    longitude: 21.26667,
    geohash: 'u2r4guvvmm4m',
    country_code: 'HU',
    population: 7494,
    estimated_radius: 400,
    feature_code: 'PPL',
    name: {
      pl: { city: 'Veszto', admin1: null, country: 'Węgry' },
      ascii: { city: 'veszto', admin1: null, country: null },
      lt: { city: 'Veszto', admin1: null, country: 'Vengrija' },
      ru: { city: 'Veszto', admin1: null, country: 'Венгрия' },
      hu: { city: 'Veszto', admin1: null, country: 'Magyarország' },
      en: { city: 'Veszto', admin1: null, country: 'Hungary' },
      fr: { city: 'Veszto', admin1: null, country: 'Hongrie' }
    }
  }
...
]

I want to be able to use the same query while using English only characters, so for this example I'd like to query by "name.pl.country": "Wegry" (Instead character ę I'd like Mongo to treat it as e while performing this query).

Is it possible to achieve this?

So far I tried using collation like this:

db.cities_database.find({ "name.pl.country": "Wegry" }).collation({ locale: "pl", strength: 1 }).pretty().limit(10);

but this query doesn't return anything.


Solution

  • I have no knowledge in Polish and I don't know the difference between e and ę. But if you use MongoDB Altas, you can set up a customAnalyzer with icuFolding to perform diacritics-insensitive search.

    The index:

    {
      "analyzer": "diacriticFolder",
      "mappings": {
        "fields": {
          "name": {
            "type": "document",
            "fields": {
              "pl": {
                "type": "document",
                "fields": {
                  "country": {
                    "analyzer": "diacriticFolder",
                    "type": "string"
                  }
                }
              }
            }
          }
        }
      },
      "analyzers": [
        {
          "name": "diacriticFolder",
          "charFilters": [],
          "tokenizer": {
            "type": "keyword"
          },
          "tokenFilters": [
            {
              "type": "icuFolding"
            }
          ]
        }
      ]
    }
    

    $search query:

    [
      {
        $search: {
          "text": {
            "query": "Wegry",
            "path": "name.pl.country"
          }
        }
      }
    ]
    

    MongoDB Atlas search playground