elasticsearchstemmingnon-english

Tweak ES documents with non-Latin text before they are stemmed?


I use the word "document" here in the sense of "Lucene Document" or LDoc, i.e. the thing which gets put into the index, analysed, etc.

I'm parsing and then indexing a whole load of .docx and .docm text files in a directory tree. To do that I'm dividing them up into blocks of 10 paragraphs (overlapping). Each 10-paragraph block constitutes an LDoc. I'm creating the index using the _bulk endpoint.

There's quite a lot of non-English text here. In a later stage I shall attempt to use a language analyser module to try and identify non-English languages using Latin script. At the moment I'm scratching my head how to handle LDocs where the string to be entered contains Greek script.

So one such LDoc text is like this:

"After the loyal things happened pledge was taken, said Klearkhos" As 
soon as the pledge was taken, Clearchus spoke:
--ἄγε δή, ὦ Ἀριαῖε, ἐπείπερ ὁ αὐτὸς ὑμῖν στόλος ἐστὶ καὶ ἡμῖν, εἰπὲ τίνα 
γνώμην ἔχεις περὶ τῆς πορείας, πότερον ἄπιμεν ἥνπερ ἤλθομεν ἢ ἄλλην τινὰ
ἐννενοηκέναι δοκεῖς ὁδὸν κρείττω. ἄγω ἄγε: 2s pres. act. imperative "command!" 
ἄγε interjection: come on; let's go; ἄγε δή: "so" {seemingly} ἐπείπερ conj.:
"seeing that" στόλος: expedition; army; fleet; troop γνώμη: sign; mark; mind;
intelligence; judgment; understanding; will; opinion ἔχεις: 2s pai περὶ prep.:
(+gen.) about; concerning; because of

Examining the results returned from an (English) stemmer query on an (English) stemmed version of the field, I find this is returned for a search on "Klearkhos":

loyal things happened pledge was taken, said <span style=\"background-color:
yellow\">Klearkhos</span>\"\nAs soon as the pledge was taken, Clearchus spoke:

(NB I'm using a highlighter, hence the span)

At first I thought that the stemmer, on encountering non-Latin text, might simply have hung up the phone and decided the rest of the LDoc text isn't worth bothering with. (NB I'm not clear why the beginning, |"After the |, hasn't been included...).

Actually it turns out that it isn't doing that. A search on "intelligence judgment expedition" returns results including this:

that expedition; army; fleet; troop: sign; mark; mind; intelligence; judgment;
understanding;

(highlighting tags omitted...)

So in fact the stemmer function appears to be dividing the submitted text into a number of different, very small, LDocs. This probably isn't the ideal way to handle these LDoc texts.

I think the best thing is probably to strip out the Greek script and just stem the remaining English. But I want the _source field to contain the whole text, regardless.

I can strip out the Greek text in my (Rust) module by detecting non-Latin characters. But how can I tell the ES server to use, for stemming purposes, a different text from the one submitted for the "whole text"?

PS naturally I'd then think about stripping out all the English and stemming all the Greek text in a given LDoc using a Greek stemmer...


Solution

  • I made an incorrect assumption in the question: what's returned as the highlighter results are "highlighting fragments", caused by the highlighting functionality, so there is no reason to believe that the submitted text fields are being split up during analysis/stemming.

    For greater clarity it is possible to return only the whole field when highlighting:

    data = \
    {
        'query: {
            ...
        },
        'highlight': {
           'number_of_fragments': 0, 
           'fields': {
               ...
    

    This then led me to having 2 fields

    1. unmodified text
    2. normalised text ... which means that all the accents are stripped out, whether Greek or Latin chars, using unicode_normalization.

    The 2 stemmer fields (one an English "analyzer", the other a Greek "analyzer") are attached to the normalised text field.

    That leaves a technical problem: the un-analysed (non-normalised) source text won't be able to be highlighted ... but there's a way round that, by picking apart the highlighted stemmed field. Something like this should work most of the time, if the pre-tags and post-tags for the FVH (fast vector highlighter) are <span ...> and </span>:

    highlighted_unnormalised_str = ''
    matching_iter = re.finditer(r'<span .*?>|</span>', highlighted_result)
    previous_pos = 0
    unnormalised_pos = 0
    for match in matching_iter:
        section_length = match.start() - previous_pos
        section_from_unnormalised = unnormalised_result[unnormalised_pos: (unnormalised_pos + section_length)]
        highlighted_unnormalised_str += section_from_unnormalised + match.group()
        unnormalised_pos = unnormalised_pos + section_length
        previous_pos = match.end()
    final_section_from_unnormalised = unnormalised_result[unnormalised_pos:]
    highlighted_unnormalised_str += final_section_from_unnormalised
    logger.info(f'highlighted_unnormalised_str\n|{highlighted_unnormalised_str}|')
    

    Sometimes this may not work: I suspect that sometimes, with some languages, normalising text may possibly result in a shorter string (Unicode is complicated). But so far this seems pretty reliable for Latin and Greek scripts.

    PS the above necessarily implies that all query strings are also normalised before submitting...

    Normalisation will be desirable 99% of the time... but there may be other language situations where diacritics/accents are felt to be significant for searching purposes.