I'm trying to understand the snoball stemming algorithmus. The algorithmus is using two regions R1 and R2 that are definied as follows:
R1 is the region after the first non-vowel following a vowel, or is the null region at the end of the word if there is no such non-vowel.
R2 is the region after the first non-vowel following a vowel in R1, or is the null region at the end of the word if there is no such non-vowel.
Examples are
b e a u t i f u l
|<------------->| R1
|<----->| R2
b e a u t y
|<->| R1
->|<- R2
a n i m a d v e r s i o n
|<----------------------------------------->| R1
|<--------------------------------->| R2
s p r i n k l e d
|<------------->| R1
->|<- R2
e u c h a r i s t
|<--------------------->| R1
|<--------->| R2
My question is, why is "kled" in springkled and "harist" in eucharist defined as R1? I thought the correct result would be "inkled" and "arist"?
You should read the definition again, it says :
R1 is the region after the first non-vowel following a vowel.
Not: followed by a vowel.
In sprinkled
, the first non-vowel following a vowel is n
, so the region after is kled
.
The same for eucharist
, the first non-vowel following a vowel is c
, so the region after is harist
.