I am doing stemming
using Porter
and Lancaster
and I find these observations:
Input: replied
Porter: repli
Lancaster: reply
Input: twice
porter: twice
lancaster: twic
Input: came
porter: came
lancaster: cam
Input: In
porter: In
lancaster: in
My question are:
Lancaster
was supposed to be "aggressive" stemmer
but it worked properly with replied
. Why?In
remained the same in Porter
with uppercase In
, Why?Lancaster
is removing words ending with e
, Why?I am not able to understand these concepts. Could you please help?
replied
. Why?It's because Lancaster stemmer implementation is improved in https://github.com/nltk/nltk/pull/1654
If we take a look at https://github.com/nltk/nltk/blob/develop/nltk/stem/lancaster.py#L62, there's a suffix rule, to change -ied > -y
default_rule_tuple = (
"ai*2.", # -ia > - if intact
"a*1.", # -a > - if intact
"bb1.", # -bb > -b
"city3s.", # -ytic > -ys
"ci2>", # -ic > -
"cn1t>", # -nc > -nt
"dd1.", # -dd > -d
"dei3y>", # -ied > -y
...)
The feature allows users to input new rules and if no additional rules are added, then it'll use the self.default_rule_tuple
in parseRules
where the rule_tuple
will be applied https://github.com/nltk/nltk/blob/develop/nltk/stem/lancaster.py#L196
def parseRules(self, rule_tuple=None):
"""Validate the set of rules used in this stemmer.
If this function is called as an individual method, without using stem
method, rule_tuple argument will be compiled into self.rule_dictionary.
If this function is called within stem, self._rule_tuple will be used.
"""
# If there is no argument for the function, use class' own rule tuple.
rule_tuple = rule_tuple if rule_tuple else self._rule_tuple
valid_rule = re.compile("^[a-z]+\*?\d[a-z]*[>\.]?$")
# Empty any old rules from the rule set before adding new ones
self.rule_dictionary = {}
for rule in rule_tuple:
if not valid_rule.match(rule):
raise ValueError("The rule {0} is invalid".format(rule))
first_letter = rule[0:1]
if first_letter in self.rule_dictionary:
self.rule_dictionary[first_letter].append(rule)
else:
self.rule_dictionary[first_letter] = [rule]
The default_rule_tuple
actually comes from the whoosh implementation of the paice-husk stemmer which aka as the Lancaster stemmer https://github.com/nltk/nltk/pull/1661 =)
This is super interesting! And most probably a bug.
>>> from nltk.stem import PorterStemmer
>>> porter = PorterStemmer()
>>> porter.stem('In')
'In'
If we look at the code, the first thing that PorterStemmer.stem()
does it to lowercase, https://github.com/nltk/nltk/blob/develop/nltk/stem/porter.py#L651
def stem(self, word):
stem = word.lower()
if self.mode == self.NLTK_EXTENSIONS and word in self.pool:
return self.pool[word]
if self.mode != self.ORIGINAL_ALGORITHM and len(word) <= 2:
# With this line, strings of length 1 or 2 don't go through
# the stemming process, although no mention is made of this
# in the published algorithm.
return word
stem = self._step1a(stem)
stem = self._step1b(stem)
stem = self._step1c(stem)
stem = self._step2(stem)
stem = self._step3(stem)
stem = self._step4(stem)
stem = self._step5a(stem)
stem = self._step5b(stem)
return stem
But if we look at the code, everything else returns the stem
, which is lowercased but there are two if clauses that returns some form of the original word
that hasn't been lowercased!!!
if self.mode == self.NLTK_EXTENSIONS and word in self.pool:
return self.pool[word]
if self.mode != self.ORIGINAL_ALGORITHM and len(word) <= 2:
# With this line, strings of length 1 or 2 don't go through
# the stemming process, although no mention is made of this
# in the published algorithm.
return word
The first if clause checks if the word is inside the self.pool
which contains the irregular words and their stems.
The second checks if the len(word)
<= 2, then return it's original form, which in the case of "In" the 2nd if clause returns True, thus the original non-lowercased form returned.
e
in "came", Why?Not surprisingly also coming from the default_rule_tuple
https://github.com/nltk/nltk/blob/develop/nltk/stem/lancaster.py#L67, there's a rule that changes -e > -
=)
-e > -
rule from default_rule_tuple
?(Un-)fortunately, the LancasterStemmer._rule_tuple
object is an immutable tuple, so we can't simply remove one item from it, but we can override it =)
>>> from nltk.stem import LancasterStemmer
>>> lancaster = LancasterStemmer()
>>> lancaster.stem('came')
'cam'
# Create a new stemmer object to refresh the cache.
>>> lancaster = LancasterStemmer()
>>> temp_rule_list = list(lancaster._rule_tuple)
# Find the 'e1>' rule.
>>> lancaster._rule_tuple.index('e1>')
12
# Create a temporary rule list from the tuple.
>>> temp_rule_list = list(lancaster._rule_tuple)
# Remove the rule.
>>> temp_rule_list.pop(12)
'e1>'
# Override the `._rule_tuple` variable.
>>> lancaster._rule_tuple = tuple(temp_rule_list)
# Et voila!
>>> lancaster.stem('came')
'came'