javascriptfull-text-searchstatic-sitenon-latinlunrjs

Greek language support for lunr.js


Registering a new stemmer function in lunr for greek words doesn't work as expected. here is my code on codepen. I am not receiving any errors, the function stemWord() works fine when used separately but it fails to stem the words in lunr. below is a sample of the code:

function stemWord(w) {
// code that returns the stemmed word
};

// create the new function
greekStemmer = function (token) {
    return stemWord(token);
};

// register it with lunr.Pipeline, this allows you to still serialise the index
lunr.Pipeline.registerFunction(greekStemmer, 'greekStemmer')

  var index = lunr(function () {
    this.field('title', {boost: 10})
    this.field('body')
    this.ref('id')

    this.pipeline.remove(lunr.trimmer) // it doesn't work well with non-latin characters
    this.pipeline.add(greekStemmer)
  })

    index.add({
    id: 1,
    title: 'ΚΑΠΟΙΟΣ',
    body: 'Foo foo foo!'
  })

  index.add({
    id: 2,
    title: 'ΚΑΠΟΙΕΣ',
    body: 'Bar bar bar!'
  })


  index.add({
    id: 3,
    title: 'ΤΙΠΟΤΑ',
    body: 'Bar bar bar!'
  })

Solution

  • In lunr a stemmer is implemented as a pipeline function. A pipeline function is executed against each word in a document when indexing the document, and each word in a search query when searching.

    For a function to work in a pipeline it has to implement a very simple interface. It needs to accept a single string as input, and it must respond with a string as its output.

    So a very simple (and useless) pipeline function would look like the following:

    var simplePipelineFunction = function (word) {
      return word
    }
    

    To actually make use of this pipeline function we need to do two things:

    1. Register it as a pipeline function, this allows lunr to correctly serialise and deserialise your pipeline.
    2. Add it to your indexes pipeline.

    That would look something like this:

    // registering our pipeline function with the name 'simplePipelineFunction'
    lunr.Pipeline.registerFunction(simplePipelineFunction, 'simplePipelineFunction')
    
    var idx = lunr(function () {
      // adding the pipeline function to our indexes pipeline
      // when defining the pipeline
      this.pipeline.add(simplePipelineFunction)
    })
    

    Now, you can take the above, and swap out the implementation of our pipeline function. So, instead of just returning the word unchanged, it could use the greek stemmer you have found to stem the word, maybe like this:

    var myGreekStemmer = function (word) {
      // I don't know how to use the greek stemmer, but I think
      // its safe to assume it won't be that different than this
      return greekStem(word)
    }
    

    Adapting lunr to work with a language other than English requires more than just adding your stemmer though. The default language of lunr is English, and so, by default, it includes pipeline functions that are specialised for English. English and Greek are different enough that you will probably run into issues trying to index Greek words with the English defaults, so we need to do the following:

    1. Replace the default stemmer with our language specific stemmer
    2. Remove the default trimmer which doesn't play so nice with non-latin characters
    3. Replace/remove the default stop word filter, its unlikely to be much use on a language other than English.

    The trimmer and stop word filter are implemented as pipeline functions, so implementing language specific ones would be similar for the stemmer.

    So, to set up lunr for Greek you would have this:

    var idx = lunr(function () {
      this.pipeline.after(lunr.stemmer, greekStemmer)
      this.pipeline.remove(lunr.stemmer)
    
      this.pipeline.after(lunr.trimmer, greekTrimmer)
      this.pipeline.remove(lunr.trimmer)
    
      this.pipeline.after(lunr.stopWordFilter, greekStopWordFilter)
      this.pipeline.remove(lunr.stopWordFilter)
    
      // define the index as normal
      this.ref('id')
      this.field('title')
      this.field('body')
    })
    

    For some more inspiration you can take a look at the excellent lunr-languages project, it has many examples of creating language extensions for lunr. You could even submit one for Greek!

    EDIT Looks like I don't know the lunr.Pipeline API as well as I thought, there is no replace function, instead we just insert the replacement after the function to remove, and then remove it.

    EDIT Adding this to help others in the future... It turns out the problem was down to the casing of the tokens within lunr. lunr wants to treat all tokens as lowercase, this is done, without any configurability, in the tokenizer. For most language processing functions this is not a problem, indeed, most assume words are lower cased. In this case, the Greek stemmer only stems uppercase words due to the complexity of stemming in Greek (I'm not a Greek speaker so can't comment on how much more complex that stemming is). A solution is to convert to upper case before calling the Greek stemmer, then convert back to lowercase before passing the tokens on to the rest of the pipeline.