pythonunicodeyqllinguisticsvespa

Vespa indexing anomaly on `exact`-indexed field with diacritical variants and non-latin Scripts


I’m using the Vespa Python client (pyvespa 0.54.0) to query a Vespa index, and I’m running into an issue where Vespa doesn't find a document it has just returned in a previous query.

I have this field in my toponym schema, indexed with match { exact }:

field name_strict type string {
    indexing: attribute | summary
    match { exact }
}

It has to handle place names in a wide variety of languages and scripts, and so before feeding the values are sanitised like this:

    # Normalise Unicode (NFC to prevent decomposition issues)
    toponym = unicodedata.normalize("NFC", toponym)
    # Remove problematic invisible Unicode characters
    toponym = toponym.translate(dict.fromkeys([0x200B, 0x2060]))
    # Ensure UTF-8 encoding (ignore errors)
    toponym = toponym.encode("utf-8", "ignore").decode("utf-8")

A visit after feeding a large and varied dataset confirms that values such as "ශ්‍රී ලංකාව", "បង់ក្លាដែស្ស", and "İslandiya" have been properly indexed, and most documents can successfully be retrieved like this:

    place_name="ශ්‍රී ලංකාව"
    name_strict = QueryField("name_strict")
    q = (
        qb.select(["*"])
        .from_("toponym")
        .where(name_strict.contains(place_name))
    )
    response = vespa_app.query(yql=q)

However, the process fails repeatedly (over multiple re-indexing attempts) with certain place names, such as "İslandiya", as illustrated by the following code which confirms the document's existence both before and after the query:

    >>> from vespa.application import Vespa
    >>> import vespa.querybuilder as qb
    >>> from vespa.querybuilder import QueryField
    >>> vespa_app = Vespa(url="http://vespa-feed.vespa.svc.cluster.local:8080")
    >>> 
    >>> response = vespa_app.query(body={"yql": "select * from toponym where is_staging = true limit 1"})
    >>> place_name = response.json['root']['children'][0]['fields']['name_strict']
    >>> print(place_name)
    İslandiya
    >>> 
    >>> name_strict = QueryField("name_strict")
    >>> q = (
    ...     qb.select(["*"])
    ...     .from_("toponym")
    ...     .where(name_strict.contains(place_name))
    ... )
    >>> response = vespa_app.query(yql=q)
    >>> print(response.json)
    {'root': {'id': 'toplevel', 'relevance': 1.0, 'fields': {'totalCount': 0}, 'coverage': {'coverage': 100, 'documents': 22084, 'full': True, 'nodes': 1, 'results': 1, 'resultsFull': 1}}}
    >>> 
    >>> response = vespa_app.query(body={"yql": "select * from toponym where is_staging = true limit 1"})
    >>> place_name = response.json['root']['children'][0]['fields']['name_strict']
    >>> print(place_name)
    İslandiya
    >>> 

From a sample of 20k place names, these are the ones that fail: İslandiya, İzlanda, İrlanda, İrlandiya, İtaliya, İtalya, ᎶᎹᏂᏯ, İspaniya, İspanya, İsveç, İsveç, ᏓᎶᏂᎨᏍᏛ, İndoneziya, İsrail, İzrail, ᏣᏩᏂᏏ, Ꙗпѡнїꙗ, İvori Sahili. For good measure, I've tried sanitising them (as in the feed process) before using them in the query, but this makes no difference.

Vespa's linguistics module is not supposed to be invoked because the field does not have the index indexing mode, but perhaps that is interfering in some way?

Can anyone please tell me what I might be doing incorrectly? Or is there a workaround?


Solution

  • You're in luck, this is a case-folding problem that's been fixed recently. I could reproduce your problem on vespa 8.485.42 but it works as expected in 8.492.15.