nlpnltkstemmingporter-stemmernltk-book

Porter and Lancaster stemming clarification


I am doing stemming using Porter and Lancaster and I find these observations:

Input: replied
Porter: repli
Lancaster: reply


Input:  twice
porter:  twice
lancaster:  twic

Input:  came
porter:  came
lancaster:  cam

Input:  In
porter:  In
lancaster:  in

My question are:

I am not able to understand these concepts. Could you please help?


Solution

  • Q: Lancaster was supposed to be "aggressive" stemmer but it worked properly with replied. Why?

    It's because Lancaster stemmer implementation is improved in https://github.com/nltk/nltk/pull/1654

    If we take a look at https://github.com/nltk/nltk/blob/develop/nltk/stem/lancaster.py#L62, there's a suffix rule, to change -ied > -y

    default_rule_tuple = (
        "ai*2.",   # -ia > -   if intact
        "a*1.",    # -a > -    if intact
        "bb1.",    # -bb > -b
        "city3s.", # -ytic > -ys
        "ci2>",    # -ic > -
        "cn1t>",   # -nc > -nt
        "dd1.",    # -dd > -d
        "dei3y>",  # -ied > -y
        ...)
    

    The feature allows users to input new rules and if no additional rules are added, then it'll use the self.default_rule_tuple in parseRules where the rule_tuple will be applied https://github.com/nltk/nltk/blob/develop/nltk/stem/lancaster.py#L196

    def parseRules(self, rule_tuple=None):
        """Validate the set of rules used in this stemmer.
        If this function is called as an individual method, without using stem
        method, rule_tuple argument will be compiled into self.rule_dictionary.
        If this function is called within stem, self._rule_tuple will be used.
        """
        # If there is no argument for the function, use class' own rule tuple.
        rule_tuple = rule_tuple if rule_tuple else self._rule_tuple
        valid_rule = re.compile("^[a-z]+\*?\d[a-z]*[>\.]?$")
        # Empty any old rules from the rule set before adding new ones
        self.rule_dictionary = {}
    
        for rule in rule_tuple:
            if not valid_rule.match(rule):
                raise ValueError("The rule {0} is invalid".format(rule))
            first_letter = rule[0:1]
            if first_letter in self.rule_dictionary:
                self.rule_dictionary[first_letter].append(rule)
            else:
                self.rule_dictionary[first_letter] = [rule]
    

    The default_rule_tuple actually comes from the whoosh implementation of the paice-husk stemmer which aka as the Lancaster stemmer https://github.com/nltk/nltk/pull/1661 =)

    Q: The word In remained the same in Porter with uppercase In, Why?

    This is super interesting! And most probably a bug.

    >>> from nltk.stem import PorterStemmer
    >>> porter = PorterStemmer()
    >>> porter.stem('In')
    'In'
    

    If we look at the code, the first thing that PorterStemmer.stem() does it to lowercase, https://github.com/nltk/nltk/blob/develop/nltk/stem/porter.py#L651

    def stem(self, word):
        stem = word.lower()
    
        if self.mode == self.NLTK_EXTENSIONS and word in self.pool:
            return self.pool[word]
    
        if self.mode != self.ORIGINAL_ALGORITHM and len(word) <= 2:
            # With this line, strings of length 1 or 2 don't go through
            # the stemming process, although no mention is made of this
            # in the published algorithm.
            return word
    
        stem = self._step1a(stem)
        stem = self._step1b(stem)
        stem = self._step1c(stem)
        stem = self._step2(stem)
        stem = self._step3(stem)
        stem = self._step4(stem)
        stem = self._step5a(stem)
        stem = self._step5b(stem)
    
        return stem
    

    But if we look at the code, everything else returns the stem, which is lowercased but there are two if clauses that returns some form of the original word that hasn't been lowercased!!!

    if self.mode == self.NLTK_EXTENSIONS and word in self.pool:
        return self.pool[word]
    
    if self.mode != self.ORIGINAL_ALGORITHM and len(word) <= 2:
        # With this line, strings of length 1 or 2 don't go through
        # the stemming process, although no mention is made of this
        # in the published algorithm.
        return word
    

    The first if clause checks if the word is inside the self.pool which contains the irregular words and their stems.

    The second checks if the len(word) <= 2, then return it's original form, which in the case of "In" the 2nd if clause returns True, thus the original non-lowercased form returned.

    Q: Notice that the Lancaster is removing words ending with e in "came", Why?

    Not surprisingly also coming from the default_rule_tuple https://github.com/nltk/nltk/blob/develop/nltk/stem/lancaster.py#L67, there's a rule that changes -e > - =)

    Q: How do I disable the -e > - rule from default_rule_tuple?

    (Un-)fortunately, the LancasterStemmer._rule_tuple object is an immutable tuple, so we can't simply remove one item from it, but we can override it =)

    >>> from nltk.stem import LancasterStemmer
    >>> lancaster = LancasterStemmer()
    >>> lancaster.stem('came')
    'cam'
    
    # Create a new stemmer object to refresh the cache.
    >>> lancaster = LancasterStemmer()
    >>> temp_rule_list = list(lancaster._rule_tuple)
    # Find the 'e1>' rule.
    >>> lancaster._rule_tuple.index('e1>') 
    12
    
    # Create a temporary rule list from the tuple.
    >>> temp_rule_list = list(lancaster._rule_tuple)
    # Remove the rule.
    >>> temp_rule_list.pop(12)
    'e1>'
    # Override the `._rule_tuple` variable.
    >>> lancaster._rule_tuple = tuple(temp_rule_list)
    
    # Et voila!
    >>> lancaster.stem('came')
    'came'