nlpwordnetstemmingtext-analysislemmatization

Stemmers vs Lemmatizers


Natural Language Processing (NLP), especially for English, has evolved into the stage where stemming would become an archaic technology if "perfect" lemmatizers exist. It's because stemmers change the surface form of a word/token into some meaningless stems.

Then again the definition of the "perfect" lemmatizer is questionable because different NLP task would have required different level of lemmatization. E.g. Convert words between verb/noun/adjective forms.

Stemmers

[in]: having
[out]: hav

Lemmatizers

[in]: having
[out]: have

Solution

  • Q1: "[..] are English stemmers any useful at all today? Since we have a plethora of lemmatization tools for English"

    Yes. Stemmers are much simpler, smaller, and usually faster than lemmatizers, and for many applications, their results are good enough. Using a lemmatizer for that is a waste of resources. Consider, for example, dimensionality reduction in Information Retrieval. You replace all drive/driving with driv in both the searched documents and the query. You do not care if it is drive or driv or x17a$ as long as it clusters inflectionally related words together.

    Q2: "[..]how should we move on to build robust lemmatizers that can take on nounify, verbify, adjectify, and adverbify preprocesses?

    What is your definition of a lemma, does it include derivation (drive - driver) or only inflection (drive - drives - drove)? Does it take into account semantics?

    If you want to include derivation (which most people would say includes verbing nouns etc.) then keep in mind that derivation is far more irregular than inflection. There are many idiosyncracies, gaps, etc. Do you really want for to change (change trains) and change (as coins) to have the same lemma? If not, where do you draw the boundary? How about nerve - unnerve, earth -- unearth - earthling, ... It really depends on the application.

    If you take into account semantics (bank would be labeled as bank-money or bank-river depending on context), how deep do you go (do you distinguish bank-institution from bank-building)? Some apps may not care about this at all, some might want to distinguish basic semantics, and some might want it fined-grained.

    Q3: "How could the lemmatization task be easily scaled to other languages that have similar morphological structures as English?"

    What do you mean by "similar morphological structures as English"? English has very little inflectional morphology. There are good lemmatizers for languages of other morphological types (truly inflectional, agglutinative, template, ...).

    With a possible exception of agglutinative languages, I would argue that a lookup table (say a compressed trie) is the best solution. (Possibly with some backup rules for unknown words such as proper names). The lookup is followed by some kind of disambiguation (ranging from trivial - take the first one, or take the first one consistent with the words POS tag, to much more sophisticated). The more sophisticated disambiguations are usually supervised stochastical algorithms (e.g. TreeTagger or Faster), although a combination of machine learning and manually created rules has been done too (see e.g. this).

    Obviously, for most languages, you do not want to create the lookup table by hand, but instead, generate it from a description of the morphology of that language. For inflectional languages, you can go the engineering way of Hajic for Czech or Mikheev for Russian, or, if you are daring, you use two-level morphology. Or you can do something in between, such as Hana (myself) (Note that these are all full morphological analyzers that include lemmatization as one of their features). Or you can learn the lemmatizer in an unsupervised manner a la Yarowsky and Wicentowski, possibly with manual post-processing, correcting the most frequent words.

    There are way too many options and it really all depends on what you want to do with the results.