I'm trying to preprocess words to remove common prefixes like "un" and "re", however all of nltk's common stemmers seem to completely ignore prefixes:
from nltk.stem import PorterStemmer, SnowballStemmer, LancasterStemmer
PorterStemmer().stem('unhappy')
# u'unhappi'
SnowballStemmer('english').stem('unhappy')
# u'unhappi'
LancasterStemmer().stem('unhappy')
# 'unhappy'
PorterStemmer().stem('reactivate')
# u'reactiv'
SnowballStemmer('english').stem('reactivate')
# u'reactiv'
LancasterStemmer().stem('reactivate')
# 'react'
Isn't part of the job of a stemmer to remove common prefixes as well as suffixes? Is there another stemmer which does this reliably?
You're right. Most stemmers only stem suffixes. In fact the original paper from Martin Porter is titled:
Porter, M. "An algorithm for suffix stripping." Program 14.3 (1980): 130-137.
And possibly the only stemmers that has prefix stemming in NLTK are the arabic stemmers:
But if we take a look at this prefix_replace
function,
it simply removes the old prefix and substitute it with the new prefix.
def prefix_replace(original, old, new):
"""
Replaces the old prefix of the original string by a new suffix
:param original: string
:param old: string
:param new: string
:return: string
"""
return new + original[len(old):]
But we can do better!
First, do you have a fixed list of prefix and substitutions for the language you need to process?
Lets go with the (unfortunately) de facto language, English, and do some linguistics work to find out prefixes in English:
https://dictionary.cambridge.org/grammar/british-grammar/word-formation/prefixes
Without much work, you can write a prefix stemming function before the suffix stemming from NLTK, e.g.
import re
from nltk.stem import PorterStemmer
# From https://dictionary.cambridge.org/grammar/british-grammar/word-formation/prefixes
english_prefixes = {
"anti": "", # e.g. anti-goverment, anti-racist, anti-war
"auto": "", # e.g. autobiography, automobile
"de": "", # e.g. de-classify, decontaminate, demotivate
"dis": "", # e.g. disagree, displeasure, disqualify
"down": "", # e.g. downgrade, downhearted
"extra": "", # e.g. extraordinary, extraterrestrial
"hyper": "", # e.g. hyperactive, hypertension
"il": "", # e.g. illegal
"im": "", # e.g. impossible
"in": "", # e.g. insecure
"ir": "", # e.g. irregular
"inter": "", # e.g. interactive, international
"mega": "", # e.g. megabyte, mega-deal, megaton
"mid": "", # e.g. midday, midnight, mid-October
"mis": "", # e.g. misaligned, mislead, misspelt
"non": "", # e.g. non-payment, non-smoking
"over": "", # e.g. overcook, overcharge, overrate
"out": "", # e.g. outdo, out-perform, outrun
"post": "", # e.g. post-election, post-warn
"pre": "", # e.g. prehistoric, pre-war
"pro": "", # e.g. pro-communist, pro-democracy
"re": "", # e.g. reconsider, redo, rewrite
"semi": "", # e.g. semicircle, semi-retired
"sub": "", # e.g. submarine, sub-Saharan
"super": "", # e.g. super-hero, supermodel
"tele": "", # e.g. television, telephathic
"trans": "", # e.g. transatlantic, transfer
"ultra": "", # e.g. ultra-compact, ultrasound
"un": "", # e.g. under-cook, underestimate
"up": "", # e.g. upgrade, uphill
}
porter = PorterStemmer()
def stem_prefix(word, prefixes):
for prefix in sorted(prefixes, key=len, reverse=True):
# Use subn to track the no. of substitution made.
# Allow dash in between prefix and root.
word, nsub = re.subn("{}[\-]?".format(prefix), "", word)
if nsub > 0:
return word
def porter_english_plus(word, prefixes=english_prefixes):
return porter.stem(stem_prefix(word, prefixes))
word = "extraordinary"
porter_english_plus(word)
Now that we have a simplistic prefix stemmer could we do better?
# E.g. this is not satisfactory:
>>> porter_english_plus("united")
"ited"
What if we check if the prefix stemmed words appears in certain list before stemming it?
import re
from nltk.corpus import words
from nltk.corpus import wordnet as wn
from nltk.stem import PorterStemmer
# From https://dictionary.cambridge.org/grammar/british-grammar/word-formation/prefixes
english_prefixes = {
"anti": "", # e.g. anti-goverment, anti-racist, anti-war
"auto": "", # e.g. autobiography, automobile
"de": "", # e.g. de-classify, decontaminate, demotivate
"dis": "", # e.g. disagree, displeasure, disqualify
"down": "", # e.g. downgrade, downhearted
"extra": "", # e.g. extraordinary, extraterrestrial
"hyper": "", # e.g. hyperactive, hypertension
"il": "", # e.g. illegal
"im": "", # e.g. impossible
"in": "", # e.g. insecure
"ir": "", # e.g. irregular
"inter": "", # e.g. interactive, international
"mega": "", # e.g. megabyte, mega-deal, megaton
"mid": "", # e.g. midday, midnight, mid-October
"mis": "", # e.g. misaligned, mislead, misspelt
"non": "", # e.g. non-payment, non-smoking
"over": "", # e.g. overcook, overcharge, overrate
"out": "", # e.g. outdo, out-perform, outrun
"post": "", # e.g. post-election, post-warn
"pre": "", # e.g. prehistoric, pre-war
"pro": "", # e.g. pro-communist, pro-democracy
"re": "", # e.g. reconsider, redo, rewrite
"semi": "", # e.g. semicircle, semi-retired
"sub": "", # e.g. submarine, sub-Saharan
"super": "", # e.g. super-hero, supermodel
"tele": "", # e.g. television, telephathic
"trans": "", # e.g. transatlantic, transfer
"ultra": "", # e.g. ultra-compact, ultrasound
"un": "", # e.g. under-cook, underestimate
"up": "", # e.g. upgrade, uphill
}
porter = PorterStemmer()
whitelist = list(wn.words()) + words.words()
def stem_prefix(word, prefixes, roots):
original_word = word
for prefix in sorted(prefixes, key=len, reverse=True):
# Use subn to track the no. of substitution made.
# Allow dash in between prefix and root.
word, nsub = re.subn("{}[\-]?".format(prefix), "", word)
if nsub > 0 and word in roots:
return word
return original_word
def porter_english_plus(word, prefixes=english_prefixes):
return porter.stem(stem_prefix(word, prefixes, whitelist))
We resolve the issue of not stemming away the prefix, causing senseless root, e.g.
>>> stem_prefix("united", english_prefixes, whitelist)
"united"
But the porter stem would have still make remove the suffix, -ed
, which may/may not be the desired output that one would require, esp. when the goal is to retain linguistically sound units in the data:
>>> porter_english_plus("united")
"unit"
So, depending on the task, it's sometimes more beneficial to use a lemma more than a stemmer.
See also: